## Machine Learning with python

###  1. Linear Regression model

In [None]:
import pandas as pd 
import numpy as np

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline

In [None]:
df = pd.read_csv("USA_Housing.csv")

In [None]:
df.head()

In [None]:
df.info() # shows general info about the DataFrame. 
# rangeIndex=number of entries(rows), number of columns,
# column labels(names), the number of cells in each column (non-null values), 
# column data types, memory usage

In [None]:
df.describe()   #computes a summary of statistics pertaining to the DataFrame
# columns. the summary is only for numeric columns (it excludes the character
# columns. 

In [None]:
df.columns

In [None]:
sns.pairplot(df)

In [None]:
sns.histplot(df['Price'], kde=True)#check the distribution of what we are trying to pridict

In [None]:
df.corr() #shows the correlation between all the columns

In [None]:
sns.heatmap(df.corr()) #plot the correlation of the columns as a heatmap

In [None]:
sns.heatmap(df.corr(),annot=True) #annot=True shows the numbers of the 
                                    #correlation

#### Do ML (use scikit Learn to train a Linear Regression model)

**Linear Regression formula**
$$\hat{y} = \hat{w}_0{x}_0 + \hat{w}_1{x}_1 + ... \hat{w}_n{x}_n + b $$ similar to
$$y = mx + b$$

In [None]:
#prepare-X array (featuers to train the model) and y array (the target variable)
#target variable in this case is the Price column 

In [None]:
# exclude Address column from the X(feature) because the Linear Regression model
# can't use text information.

In [None]:
df.columns

In [None]:
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']]

In [None]:
y = df['Price']

In [None]:
#Split our dataset
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
#test_size=0.4 indicats 40% of the dataset to be allocated to the test set
#random_state=101 to have the same random split everytime

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
#then we instantiate an instance of the linearRegression model
lm = LinearRegression() #creating a linearRegression object

In [None]:
#train (fit) our linearRegression model
lm.fit(X_train, y_train)

In [None]:
# evaluate our model by checking its coefficent and see how to interpreat it
print(lm.intercept_)  #y intercept (b) (constant bias term)

In [None]:
lm.coef_  #coefficents are related to each feature of our dataset
# w0, w1, w2, w3 and w4

In [None]:
X_train.columns

##### Now create a dataframe using these coefficents inorder to organize this data better.
##### Datafame format: 
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)


In [None]:
cdf = pd.DataFrame(lm.coef_,X.columns,columns=['Coeff'])#cdf=coefficent dataFrame

In [None]:
cdf

#### Predictions

In [None]:
predictions = lm.predict(X_test)

In [None]:
predictions # shows the predicted prices of the house

In [None]:
y_test # contains the correct prices of the house.

##### Now we can compare our predictions with the correct price (y_test)

We can quicly analyze this by visualizing both predictions and y_test variables

In [None]:
plt.scatter(y_test, predictions)

Looks like both y_test and predictions are lining up in a straight line which shows us our predition is very good

Now let's create a histogram of the residuals (difference between y_test and predictions)

In [None]:
sns.histplot(y_test-predictions, kde=True)

The above histogram shows that our residuals are normally distributed (meaning the model is a correct choice for the data).
If it wasn't normally distributed we should check if linear regression model is the correct choice for our dataset.

## Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
$y_i$ is True y label and $\hat{y}_i$ is predicted y

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world. But MSE squares the y units as well e.g. $\$^2$ 
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units (eliminates the square of the y unit).

All of these are **loss functions**, because we want to minimize them.

#### How to calculate Regression Evaluation Metrics on sklearn

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

##### We can do the above data analysis on a real housing data set called boston data set
###### How to load boston data set from sklearn
Here I only show how to load and check the data (analysis part not done)

In [None]:
from sklearn.datasets import load_boston

In [None]:
boston = load_boston()

In [None]:
boston.keys()

In [None]:
#print(boston['DESCR']) # to check the description

In [None]:
#print(boston['data']) # to check the data

In [None]:
#print(boston['feature_names']) # in a dataframe these are column names

In [None]:
#print(boston['target']) # shows the target prices in thousands