# Linear Regression with Python

USA_Housing.csv contains the following columns:

* 'Avg. Area Income': Avg. Income of residents of the city house is located in.
* 'Avg. Area House Age': Avg Age of Houses in same city
* 'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in same city
* 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in same city
* 'Area Population': Population of city house is located in
* 'Price': Price that the house sold at
* 'Address': Address for the house

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np

In [None]:
housetable = pd.read_csv('../input/usa-housing/USA_Housing.csv')


In [None]:
housetable.head()

In [None]:
housetable.describe()

In [None]:
housetable.info()

In [None]:
housetable.columns

# EDA

Let's create some simple plots to check out the data!

In [None]:
sns.pairplot(housetable, palette="husl", markers='^')

In [None]:
sns.distplot(housetable['Price'], color='m')

In [None]:
sns.heatmap(housetable.corr(), cmap="viridis",annot=True)

In [None]:
X = housetable[housetable.columns[:-2]]


In [None]:
y = housetable['Price']

## Train Test Split

let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

In [None]:
X_test.shape

## Creating and Training the Model

In [None]:
from sklearn.linear_model import LinearRegression


In [None]:
line = LinearRegression()

In [None]:
line.fit(X_train,y_train)

## Model Evaluation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [None]:
# print the intercept
print(line.intercept_)

In [None]:
coeff_df = pd.DataFrame(line.coef_,X.columns,columns=['Coefficient'])
coeff_df

## Predictions from our Model

Let's grab predictions off our test set and see how well it did!

In [None]:
predictions = line.predict(X_test)

In [None]:
plt.scatter(y_test,predictions, c='g',marker='.')

In [None]:
plt.figure(figsize=(12,4 ))
sns.distplot((y_test-predictions),bins=50,)


## Regression Evaluation Metrics

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

In [None]:
from sklearn import metrics

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))

<br>
<br>
<br>

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

In [None]:
print('MSE:', metrics.mean_squared_error(y_test, predictions))


**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [None]:
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))