# Useful Python codes for Machine Learning
This notebook contains a list of scripts to train the machine learning model for my reference.
<n>
It is based on an udemy course "Python for Data Science and Machine Learning Bootcamp" by Jose Portilla <br>
https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp

In [None]:
# USEFUL LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning
# sklearn
# tensorflow
# keras

# for linear regression, logistics regression, knn
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# for spliting data into training and testing model
#from sklearn.cross_validation import train_test_split #old version
from sklearn.model_selection import train_test_split #version 0.20 onwards

# for calculating MAE, MSE, RMSE (for linear regression)
from sklearn import metrics

# for getting the precision and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

# see visualisation in notebook
%matplotlib inline

# Data Exploration

 Useful article
 https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

 In the data exploration, we want to
 - find the correlation between each column
 - check if there's any missing data
 - if there's missing data, see if you can impute(fill) the missing values by getting the other columns with good correlation, then use various methods and substitute the missing values. 
 - Otherwise, drop it
 - check if there's any outliers
 - if there's outliers, find out why, and determine if you should drop or not
 - ** if outlier is incorrectly entered: drop it

In [None]:
# DATA INFO
# df = data
# to get the info of the data
df.info()

# to show data columns
df.columns

# to show descriptive stats info of each columns (eg: mean, std_dev, min, max etc)
df.describe()

# to show the correlation of the data (which column has a good relationship with other column)
df.corr()

In [None]:
# VISUALIZATIONS

# to turn the seaborn plot style to whitegrid
sns.set_style('whitegrid')

# PAIRPLOT: to see histograms of all the columns and correlation of scatterplots
sns.pairplot(df)

# to see histograms of all the columns and correlation, color code in a column
sns.pairplot(df, hue='*column_name*')

# HEATMAP: to show the correlation in heatmap
sns.heatmap(df.corr(), annot=True)

# HISTOGRAM: to check out the distribution & the frequency of a column
# ** bins = range of values divided into xx equal parts
# ** to turn off the kde, kde=False
sns.distplot(df['*column_name*'], bins=30)

# another way to plot distribution
df['*column_name*'].plot.hist(bins=30)

# to increase the size of the plot, use figsize = ()
# plt.figure(figsize=(10,7)) or
df['*column_name*'].hist(bins=30, figsize=(10,4))

# BOXPLOT: useful for comparing distributions across groups & checkout outliers
plt.figure(figsize=(10,7))
sns.boxplot(x='*x_column_name*',y='*y_column_name*', data=df)


# CHECK MISSING DATA
# to chekc missing data, use heatmap to check missing data, yellow strips = null values
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap ='viridis')

# to count a feature, use countplot, if adding categorize by another feature, include in hue
# eg: count the number of titanic survivor by gender
# sns.countplot(x='Survived', hue='Sex', data=df, palette='RdBu_r')
sns.countplot(x='*column_name*', hue='*category_column_name*', data=df, palette='*color_name*')

In [None]:
# for interactive plot, you can use cufflinks
import cufflinks as cf
cf.go_offline()

# eg: histogram interactive plot
df['*column_name*'].iplot(kind='hist', bins=30)

# Linear Regression
- Used when the dependent variable (the variable you want to predict,y) is numeric
- linear approach to modelling the relationship between a scalar response and >= 1 explanatory variables

In [None]:
# NOMNINATE X AND Y
# X = featured data used to predict y
# y = data you want to predict
X = df[[]]
y = df[]

In [None]:
# split the data into training set and testing set of the model
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)

In [None]:
# import linear regression model and create it
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

In [None]:
# TRAIN AND TEST THE MODEL
# train the model 
lm.fit(X_train, y_train)

# predict the testing sets with the model
predict = lm.predict(X_test)


In [None]:
# VISUALIZE THE RESULTS
# to check out the preiction model compare to the actual, plot a scatter
# if it shows straight line = quite a good model
plt.scatter(y_test, predictions)

# to check out the residuals, you can plot histogram of the distribution
sns.distplot((y_test-predict))

# ** RESIDUALS: difference between actual values of y test and the predicted values,
# it's a measure of how well a predict line fits an individual actual data point, basically in the distribution plot
# we want to get as much 0 as possible

In [None]:
# MODEL RESULTS: COEFFICIENTS
# to check out the coefficient of each features
lm.coef_

# to make the coefficient into a table
cdf = pd.DataFrame(lm.coef_, X.columns, columns=['Coeff'])
cdf

# coefficient tells you that a 1 unit increase in a feature is associated with an increase of a coefficient 
# of the predicted value 

In [None]:
# MODEL RESULTS: EVALUATION METRICS
# Evaluate metrics, basically: 
# - Mean Absolute Error (MAE)
# - Mean Squared Error (MSE)
# - Root Mean Squared Error (RMSE)
# You want to minimize all of them, the smaller number the better
from sklearn import metrics

# Mean absolute error
MAE = metrics.mean_absolute_error(y_test, predict)

# Mean Squared Error (MSE)
MSE = metrics.mean_squared_error(y_test, predict)

# Root Mean Squred Error (RMSE)
RMSE = np.sqrt(MSE)

# Logistics Regression
- used when the dependent variable is categorical (classification problem)
- used to predict the odds of being a case based on the values of the independent variables (predictors,X)

In [None]:
# convert categorical features using pandas' get_dummies
# eg: turned Female & Male to 0 & 1 so that the ML algorithm will understand the categorical features
# drop_first = True, you want to drop the 1st column since the other col already categorize it
# eg: male column, 0 = female, 1 = male.
cat_feat = pd.get_dummies(df['*column_name_to_transform*'], drop_first=True)

# concatenate with the data
df = pd.concat([cat_feat], axis=1)

In [None]:
# NOMINATE X AND Y
X = df[[]] # only numerical featured values 
y = df[] # data you want to predict

In [None]:
# split the data into training set and testing set of the model
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)

In [None]:
# import logistic regression model and create it
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()

In [None]:
# TRAIN AND TEST THE MODEL
# train the model 
logmodel.fit(X_train, y_train)

# predict the testing sets with the model
predict = logmodel.predict(X_test)

In [None]:
# MODEL RESULTS: EVALUATE THE MODEL
# using classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

# show classification report
print(classification_report(y_test, predict))

# show confusion matrix
print(confusion_matrix(y_test, predict))

Confusion Matrix: <br>
TN, FP <br>
FN, TP <br>
TN = true negatives, predict = no, actual = no <br>
FP = False positive, predict = yes, actual = no <br>
FN = false negative, predict = no, actual = yes <br>
TP = true positive, predict =yes, actual = yes <br>
<n>
You would want to have a very high TN & TP and low FP & FN. 

<n>
As for classification report, you would want a very high precision (close to 1)
   

# K Nearest Neighbours
- Supervised Technique
- Used for Classification or Regression of known data where usually the target variable is known before hand
- K = number of nearest neighbours used to classify. 
eg: you have X to group into A and B, k = 3, X nearest neighbour = 2 in group A and 1 in group B, X is group A. 

Note: Because the KNN and classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variable actually matters a lot, and any of the variables that have large scale, will have a much larger effect on the distance between observations. Thus when you use KNN you wanna standardized everything in the same scale. 

In [None]:
# SCALE THE FEATURED DATA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fits the scaler to the data (finds the min, max, mean that will later use in its transform operation)
# remember to drop the column you want to predict
df_notc = df.drop('*y_column_name*', axis=1)
scaler.fit(df_notc)

# use scalar to do transformation
scaled_feature = scaler.transform(df_notc)

# create a table for scaled feature
df_feat = pd.DataFrame(scaled_feature, columns=df_notc)

In [None]:
# NOMINATE X AND Y
X = df_feat #standardized featured data
y = df[] #data you want to predict

In [None]:
# split the data into training set and testing set of the model
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)

In [None]:
# import KNN model and create it
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

n_neighbors = 1 is k=1 <br>
When tested with a new example, it looks through the training data and finds the k training examples that are closest to the new example. k thus is the number of neighbors considered. <br>
eg: k=2, the 2 closest neighbors are used to smooth the estimate at a given point

In [None]:
# TRAIN AND TEST THE MODEL
# train the model 
knn.fit(X_train, y_train)

# predict the testing sets with the model
predict = knn.predict(X_test)

In [None]:
# MODEL RESULTS: EVALUATE THE MODEL
# using classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

# show classification report
print(classification_report(y_test, predict))

# show confusion matrix
print(confusion_matrix(y_test, predict))

It will show the results with k=1, If the results aren't good, or if you wish to improve more, use elbow method (for loop) to pick the best k value

In [None]:
# Elbow method
error_rate[]

for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    
    # append the error rate into the list
    error_rate.append(np.mean(pred_i != y_test))


In [None]:
# plot the error rate
plt.figure(figsize=(10,6))
plt.plot(range(1,40), error_rate,color='blue', linestyle='--',marker='o',markersize=10,markerfacecolor='red')
plt.title('Error Rate vs K Value')
plt.xlabel('k')
plt.ylabel('Error Rate')

Once you plot it, check which k value brings the lowest error rate, you would want to pick a number that has been consistently producing lower error rate before and after. 
<n>
Once you've picked the k value, re-run the model again to check the precision and the confusion matrix to see if you're getting a better result

# Decision Trees and Random Forest

- both are supervised technique
- Used for classification and regression
- Decision trees = behaves with "if this then that" conditions ultimately yielding a specific result.
- DT are easy to interpret, can handle both numerical and categorical data, performs well on large datasets, but are prone to overfitting.
- Random Forest = a collection or ensemble of decision trees. A fraction of the number of rows is selected at random and a particular number of features are selected at random to train on and a decision tree is built on this subset. <br>
ie: (a collection of trees = Forest, and trees being trained on subsets which are being selected at random)


Difference between decision trees and logistic regression
- logistic regression : searching fo a single linear decision boundary in the feature space. You manuallly add interactions terms.
- decision trees: partitioning the feature space into half-space using axis-aligned linear decision boundaries. The net effect is that you have a non-linear decision boundary, possibly more than one. Automatically take into account interaction between variables.
<p>

It's always good to try both models and do cross-validation

In [None]:
# NOMINATE X AND Y
X = df[[]] # only numerical featured values 
y = df[] # data you want to predict

In [None]:
# split the data into training set and testing set of the model
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=101)

In [None]:
# train with 1 decision tree
from sklearn.tree import DecisionTreeClassifier

# train with Random Forest
from sklearn.ensemble import RandomForestClassifier

# decision tree
dtree = DecisionTreeClassifier()

# random forest, you can work with the estimator numbers, 200 is alright
rfc = RandomForestClassifier(n_estimator=200)

In [None]:
# train the data with decision tree/Random Forest
dtree.fit(X_train, y_train)
rfc.fit(X_train, y_train)

In [None]:
# predict the test data with the model
prediction = dtree.predict(X_test) # --OR
precition = rfc.predict(X_test)

In [None]:
# MODEL RESULTS: EVALUATE THE MODEL
# using classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

# show classification report
print(classification_report(y_test, prediction))

# show confusion matrix
print(confusion_matrix(y_test, prediction))

# Support Vector Machine
- Supervised learning
- Used for classification or regression problems.
- Uses a technique called the kernel trick to transform data and then based on these transformations it finds an optimal boundary between the possible outputs. (does some extremely complex data transformation, then figures out how to separate the data based on the labels or outputs you've defined.


Random Forest & SVM
- SVM performs better on sparse data.
- Random Forest is suited for multiclass problems.

<p>
But it's always to test out all similar models. SVM, Decision Trees/Random Forest, logistic regression.

In [None]:
# NOMINATE X AND Y
X = df[[]] # only numerical featured values 
y = df[] # data you want to predict

In [None]:
# split the data into training set and testing set of the model
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

In [None]:
# train with svm
from sklearn.svm import svc

model = SVC()

In [None]:
# fit the model
model.fit(X_train, y_train)

In [None]:
# predict the test data using the model
prediction = model.predict(X_test)

Check the results if the model is good enough. Or you can use grid search to find the best parameters to train the model.

Grid Search=
- C: controls the cost of misclassifictaion on the training data. Large C values = low bias & high variance, low bias because you penalize the cost of misclassification a lot. If small C values, you are not going to penalize that cost as much, so high bias low variance.
- Gamma: has to do with gaussian radio base function (rbf), which is what kernel ='rbf' shows. Small gamma=gaussian for large variance. Big gamma=lead to high bias & low variance in the model.
<br>
if gamma = large, variance = small --> support vector don't have widespread 

In [None]:
# GRID SEARCH
# find the correct parameters to train a model in svm
from sklearn.grid_search import GridSearchCV

In [None]:
# a dictionary where the keys are the actual parameters that go into the model you're using (SVC)
param_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1,0.01,0.001,0.0001]}

In [None]:
# create a grid search
# verbose = the higher the number, the more verbose (text output of the description of the process)
grid = GridSearchCV(SVC(), param_grid, verbose=3 )

*** NOTE: grid search could take a long time depending how large the data is 

In [None]:
# fit the model
grid.fit(X_train, y_train)

In [None]:
# get the best parameters of the grid
grid.best_params_

In [None]:
# get the best estimator for the SVC
grid.best_estimator_

In [None]:
# predict the test data with the best parameters model
grid_predictions = grid.predict(X_test)

In [None]:
# MODEL RESULTS: EVALUATE THE MODEL
# using classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

# show classification report
print(classification_report(y_test, grid_prediction))

# show confusion matrix
print(confusion_matrix(y_test, grid_prediction))

With grid search to find the best parameters, usually it performs better (higher accuracy) compare to the one without grid search. The larger the data the more visible it is.

# K-Means Clusters
- Unsupervised learning
- Used in unlabeled data (i.e., data without defined categories or groups)
- Solved clustering problem by classifying a given data set into a number of clusters (k)

In [None]:
# use K-means cluster
from sklearn.cluster import KMeans

In [None]:
# nominate number of clusters
kmeans = KMeans(n_clusters=4)

*** NOTE: There's no easy answer for choosing the best k value, but one way to try and do it is by __Elbow method__. 
<p>

We use that method in __K Nearest Neighbors__

<p>

1. Compute the sum of squared error (SSE) for some values of k (eg:2,4,6,8...). The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid.
2. When plot k against the SSE, you'll see the error decreases as k gets larger. When cluster# increases, they should be smaller so distortion is also smaller.
3. Elbow method = choose the k at which the SSE decreases abruptly, meaning you want to choose the k where you don't get much information by increasing the cluster# aka not going to significantly decrease within the groups SSE. 

In [None]:
# fit the model
kmeans.fit(data)

In [None]:
# to see the centroids of the clusters
kmeans.clusters_centers_

In [None]:
# to see the predicted clusters for the data
kmeans.labels_

In [None]:
# to plot and see how the model predicts the clustering
plt.scatter(x1_data, x2_data, c=kmeans.labels_, cmap='rainbow')

# Principal Component Analysis
- unsupervised learning
- statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
- often used to make data easy to explore and visualize
- not used to predict
- mostly trying to find out what components are the most important ones that explain the most variance of the data set.- 
- see here for more info: http://setosa.io/ev/principal-component-analysis/


Before running the PCA data, we need to find the first 2 principal components and visualize the data
<p>
We need to scale the data so that each feature has a single unit variance before we can use PCA on this

In [None]:
# data in dataframe
df

In [None]:
# SCALE THE DATA
# import standard scaler from scikit-learn
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
# fit the model to the dataframe
scaler.fit(df)

In [None]:
# transform it
scaled_data = scaler.transform(df)

In [None]:
# PERFORM PCA
from sklearn.decomposition import PCA

# nominate the number of components you want to keep
pca = PCA(n_components=2)

In [None]:
# fit the model to the data
pca.fit(scaled_data)

In [None]:
# transform
x_pca = pca.transform(scaled_data)

In [None]:
# to check the rows & columns after running the pca
x_pca.shape

You should just get 2 principal components. We have transform 10++ dimensions (featured data) to just 2 

In [None]:
# VISUALIZATION
# to visualize 1st and 2nd principal component, c = color
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0], x_pca[:,1], c= target_data, cmap='plasma')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

By visualizing, we can easily separate these 2 classes 

In [None]:
# to visualize what principal component represents, we can use heatmap
df_comp = pd.DataFame(pca.components_, columns=featured_column_name)

# heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df_comp, cmap='plasma')

The heatmap will show the relationship between the correlation of the various feature and the principal component themselves.
<p>
The higher the color (towards yellow in plasma), the more correlated to a specific feature in the columns

# Natural Language Processing (NLP) 

-  NLP = application of computational techniques to the analysis and synthesis of natural language and speech
- helps computer to understand human language (it breaks down and process language)


Before doing the Vectorization (turning each message into a vector that machine learning models can understand), we need to do text-preprocessing - remove punctuation & stop words 

In [None]:
# TEXT PRE-PROCESSING
# Need to remove punctuation (.,@#$%^&; etc) & stop words (I, you, we, she, he, himself, herself, them, themselves etc)
import string
import nltk

# data in dataFrame: col: label=spam or ham(normal message), message=sms messages content
messages

In [None]:
# function to remove punctuation and stop words for all messages
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    
    # check characters to see if they're in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]
    
    # join the characters again to form the string
    nopunc = ''.join(nopunc)
    
    # remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [None]:
# apply the function
messages['message'].apply(text_process)

When text-preprocessing is done, we need to convert the list of words to an actual vector that scikit-learn can use.The process is called vectorization

There's 3 steps: <br>
1. Count how many times does a word occur in each message (term frequency)
2. Weight the counts, so that frequent tokens get lower weight (inverse document frequency)
3. Normalize the vectors to unit length, to abstract from the original text length (L2 norm)
    
<p>
 Once the vectorization process is done aka text messages have converted to numerical vectors, we can train the model to predict the label.

<p>
 Fortunately, sckit-learn pipeline does all these steps from vectorization to training the model 

In [None]:
# VECTORIZATION & TRAINING
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# use Naive Bayes to train the model
from sklearn.naive_bayes import MultinomialNB

In [None]:
# split the data into training set and testing set of the model
from sklearn.model_selection import train_test_split

msg_train, msg_test, label_train, label_test = train_test_split(messages['message'],messages['label'], test_size=0.3)

In [None]:
# Creating a data pipeline
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [None]:
# fit the model
pipeline.fit(msg_train, label_train)

In [None]:
# predict the test data with the model
predictions = pipeline.predict(msg_test)

In [None]:
# MODEL RESULTS: EVALUATE THE MODEL
# using classification report
from sklearn.metrics import classification_report

# show classification report
print(classification_report(label_test, predictions))


In [None]:
# ** To use different model from Naive Bayes, eg: Random Forest
from sklearn.ensemble import RandomForestClassifier

# pipeline
pipelineRF = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)), 
    ('tfidf', TfidfTransformer()), 
    ('classifier', RandomForestClassifier()),  
])

# train the data with Random Forest
pipelineRF.fit(msg_train, label_train)

# predict the test data with the model
predictionsRF = pipelineRF.predict(msg_test)

# show result
print(classification_report(label_test, predictionsRF))