# Useful Python codes for Machine Learning
This notebook contains a list of scripts to train the machine learning model for my reference.
<n>
It is based on an udemy course "Python for Data Science and Machine Learning Bootcamp" by Jose Portilla <br>
https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp

In [None]:
# USEFUL LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning
# sklearn
# tensorflow
# keras

# for linear regression, logistics regression, knn
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# for spliting data into training and testing model
from sklearn.cross_validation import train_test_split

# for calculating MAE, MSE, RMSE (for linear regression)
from sklearn import metrics

# for getting the precision and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

# see visualisation in notebook
%matplotlib inline

# Data Exploration

 Useful article
 https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

 In the data exploration, we want to
 - find the correlation between each column
 - check if there's any missing data
 - if there's missing data, see if you can impute(fill) the missing values by getting the other columns with good correlation, then use various methods and substitute the missing values. 
 - Otherwise, drop it
 - check if there's any outliers
 - if there's outliers, find out why, and determine if you should drop or not
 - ** if outlier is incorrectly entered: drop it

In [None]:
# DATA INFO
# df = data
# to get the info of the data
df.info()

# to show data columns
df.columns

# to show descriptive stats info of each columns (eg: mean, std_dev, min, max etc)
df.describe()

# to show the correlation of the data (which column has a good relationship with other column)
df.corr()

In [None]:
# VISUALIZATIONS

# to turn the seaborn plot style to whitegrid
sns.set_style('whitegride')

# PAIRPLOT: to see histograms of all the columns and correlation of scatterplots
sns.pairplot(df)

# to see histograms of all the columns and correlation, color code in a column
sns.pairplot(df, hue='*column_name*')

# HEATMAP: to show the correlation in heatmap
sns.heatmap(df.corr(), annot=True)

# HISTOGRAM: to check out the distribution & the frequency of a column
# ** bins = range of values divided into xx equal parts
# ** to turn off the kde, kde=False
sns.distplot(df['*column_name*'], bins=30)

# another way to plot distribution
df['*column_name*'].plot.hist(bins=30)

# to increase the size of the plot, use figsize = ()
# plt.figure(figsize=(10,7)) or
df['*column_name*'].hist(bins=30, figsize=(10,4))

# BOXPLOT: useful for comparing distributions across groups & checkout outliers
plt.figure(figsize=(10,7))
sns.boxplot(x='*x_column_name*',y='*y_column_name*', data=df)


# CHECK MISSING DATA
# to chekc missing data, use heatmap to check missing data, yellow strips = null values
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap ='viridis')

# to count a feature, use countplot, if adding categorize by another feature, include in hue
# eg: count the number of titanic survivor by gender
# sns.countplot(x='Survived', hue='Sex', data=df, palette='RdBu_r')
sns.countplot(x='*column_name*', hue='*category_column_name*', data=df, palette='*color_name*')

In [None]:
# for interactive plot, you can use cufflinks
import cufflinks as cf
cf.go_offline()

# eg: histogram interactive plot
df['*column_name*'].iplot(kind='hist', bins=30)

# Linear Regression
- Used when the dependent variable (the variable you want to predict,y) is numeric
- linear approach to modelling the relationship between a scalar response and >= 1 explanatory variables

In [None]:
# NOMNINATE X AND Y
# X = featured data used to predict y
# y = data you want to predict
X = df[[]]
y = df[]

In [None]:
# split the data into training set and testing set of the model
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)

In [None]:
# import linear regression model and create it
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

In [None]:
# TRAIN AND TEST THE MODEL
# train the model 
lm.fit(X_train, y_train)

# predict the testing sets
predict = lm.predict(X_test)


In [None]:
# VISUALIZE THE RESULTS
# to check out the preiction model compare to the actual, plot a scatter
# if it shows straight line = quite a good model
plt.scatter(y_test, predictions)

# to check out the residuals, you can plot histogram of the distribution
sns.distplot((y_test-predict))

# ** RESIDUALS: difference between actual values of y test and the predicted values,
# it's a measure of how well a predict line fits an individual actual data point, basically in the distribution plot
# we want to get as much 0 as possible

In [None]:
# MODEL RESULTS: COEFFICIENTS
# to check out the coefficient of each features
lm.coef_

# to make the coefficient into a table
cdf = pd.DataFrame(lm.coef_, X.columns, columns=['Coeff'])
cdf

# coefficient tells you that a 1 unit increase in a feature is associated with an increase of a coefficient 
# of the predicted value 

In [None]:
# MODEL RESULTS: EVALUATION METRICS
# Evaluate metrics, basically: 
# - Mean Absolute Error (MAE)
# - Mean Squared Error (MSE)
# - Root Mean Squared Error (RMSE)
# You want to minimize all of them, the smaller number the better
from sklearn import metrics

# Mean absolute error
MAE = metrics.mean_absolute_error(y_test, predict)

# Mean Squared Error (MSE)
MSE = metrics.mean_squared_error(y_test, predict)

# Root Mean Squred Error (RMSE)
RMSE = np.sqrt(MSE)

# Logitsitc Regression
- used when the dependent variable is categorical (classification problem)
- used to predict the odds of being a case based on the values of the independent variables (predictors,X)

In [None]:
# convert categorical features using pandas' get_dummies
# eg: turned Female & Male to 0 & 1 so that the ML algorithm will understand the categorical features
# drop_first = True, you want to drop the 1st column since the other col already categorize it
# eg: male column, 0 = female, 1 = male.
cat_feat = pd.get_dummies(df['*column_name_to_transform*'], drop_first=True)

# concatenate with the data
df = pd.concat([cat_feat], axis=1)

In [None]:
# NOMINATE X AND Y
X = df[[]] # only numerical featured values 
y = df[] # data you want to predict

In [None]:
# split the data into training set and testing set of the model
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)

In [None]:
# import logistic regression model and create it
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()

In [None]:
# TRAIN AND TEST THE MODEL
# train the model 
logmodel.fit(X_train, y_train)

# predict the testing sets
predict = logmodel.predict(X_test)

In [None]:
# MODEL RESULTS: EVALUATE THE MODEL
# using classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

# show classification report
print(classification_report(y_test, predict))

# show confusion matrix
print(confusion_matrix(y_test, predict))

Confusion Matrix: <br>
TN, FP <br>
FN, TP <br>
TN = true negatives, predict = no, actual = no <br>
FP = False positive, predict = yes, actual = no <br>
FN = false negative, predict = no, actual = yes <br>
TP = true positive, predict =yes, actual = yes <br>
<n>
You would want to have a very high TN & TP and low FP & FN. 

<n>
As for classification report, you would want a very high precision (close to 1)
   

# K Nearest Neighbours
- Supervised Technique
- Used for Classification or Regression of known data where usually the target variable is known before hand

Note: Because the KNN and classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variable actually matters a lot, and any of the variables that have large scale, will have a much larger effect on the distance between observations. Thus when you use KNN you wanna standardized everything in the same scale. 

In [None]:
# SCALE THE FEATURED DATA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fits the scaler to the data (finds the min, max, mean that will later use in its transform operation)
# remember to drop the column you want to predict
df_notc = df.drop('*y_column_name*', axis=1)
scaler.fit(df_notc)

# use scalar to do transformation
scaled_feature = scaler.transform(df_notc)

# create a table for scaled feature
df_feat = pd.DataFrame(scaled_feature, columns=df_notc)

In [None]:
# NOMINATE X AND Y
X = df_feat #standardized featured data
y = df[] #data you want to predict

In [None]:
# split the data into training set and testing set of the model
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)

In [None]:
# import KNN model and create it
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

n_neighbors = 1 is k=1 <br>
When tested with a new example, it looks through the training data and finds the k training examples that are closest to the new example. k thus is the number of neighbors considered. <br>
eg: k=2, the 2 closest neighbors are used to smooth the estimate at a given point

In [None]:
# TRAIN AND TEST THE MODEL
# train the model 
knn.fit(X_train, y_train)

# predict the testing sets
predict = knn.predict(X_test)

In [None]:
# MODEL RESULTS: EVALUATE THE MODEL
# using classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

# show classification report
print(classification_report(y_test, predict))

# show confusion matrix
print(confusion_matrix(y_test, predict))

It will show the results with k=1, If the results aren't good, or if you wish to improve more, use elbow method (for loop) to pick the best k value

In [None]:
# Elbow method
error_rate[]

for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    
    # append the error rate into the list
    error_rate.append(np.mean(pred_i != y_test))


In [None]:
# plot the error rate
plt.figure(figsize=(10,6))
plt.plot(range(1,40), error_rate,color='blue', linestyle='--',marker='o',markersize=10,markerfacecolor='red')
plt.title('Error Rate vs K Value')
plt.xlabel('k')
plt.ylabel('Error Rate')

Once you plot it, check which k value brings the lowest error rate, you would want to pick a number that has been consistently producing lower error rate before and after. 
<n>
Once you've picked the k value, re-run the model again to check the precision and the confusion matrix to see if you're getting a better result