<a name='home'></a>
# Machine Learning with Python
## Table of Content
1. [Introduction to ML](#intro)
2. [Linear Regression](#reg)
3. [Non-Linear Regression](#nonlin)
4. [Classification](#class)  
    4.1 [K-Nearest Neighbours](#knear)  
    4.2 [Evaluation Metrics in Classification](#metric)  
    4.3 [Decision Trees](#tree)  
    4.4 [Logistic Regression](#log)  
    4.5 [Support Vector Machine](#svm)  
5. [Clustering](#clust)  
    5.1 [k-Means Clustering](#kclust)  
    5.2 [Hierarchical Clustering](#hclust)  
    5.3 [Density based Clustering](#dclust)
6. [Content-based Recommendation Engines](#cbre)  
7. [Collaborative Filtering](#colfil)  
10. [LabExercise](#lab)  
    10.1 [Linear Regression](#lab_lin)  
    10.2 [Multiple Linear Regression](#lab_mult)  
    10.3 [Polynomial Regression](#lab_pol)  
    10.4 [Non-linear Regression](#lab_nonlin)  
    10.5 [K-Nearest Neighbours](#lab_knear)  
    10.6 [Decision Trees](#lab_tree)  
    10.7 [Logistic Regression](#lab_log)  
    10.8 [Support Vector Machine](#lab_svm)  
    10.9 [k-Means Clustering](#lab_kclu)  
    10.10 [Hierarchical Clustering](#lab_hclust)    
    10.11 [Density-based Clusterig](#lab_dclust)  
    10.12 [Content-based Recommender Engines](#lab_cbre)  
    10.13 [Collaborative Filtering](#lab_colfil)


<a name='intro'></a>
## Introduction to Machine Learning (ML)
Machine learning is the subfield of computer science that gives "computers the ability to learn without being explicitly programmed.” 

Using machine learning, allows us to build a model that looks at all the feature sets, and their corresponding type, and it learns the pattern of each animal. It is a model built by machine learning algorithms. It detects without explicitly being programmed to do so. So, machine learning algorithms, inspired by the human learning process, iteratively learn from data, and allow computers to find hidden insights. These models help us in a variety of tasks, such as object recognition, summarization, recommendation, and so on. 

Examine a few of the more popular techniques. 
* The **Regression/Estimation** technique is used for predicting a continuous value. For example, predicting things like the price of a house based on its characteristics, or to estimate the Co2 emission from a car’s engine. 
* A **Classification** technique is used for Predicting the class or category of a case, for example, if a cell is benign or malignant, or whether or not a customer will churn. 
* **Clustering** groups of similar cases, for example, can find similar patients, or can be used for customer segmentation in the banking field. 
* **Association** technique is used for finding items or events that often co-occur, for example, grocery items that are usually bought together by a particular customer. 
* **Anomaly detection** is used to discover abnormal and unusual cases, for example, it is used for credit card fraud detection. Sequence mining is used for predicting the next event, for instance, the click-stream in websites. Dimension reduction is used to reduce the size of data.
* **Recommendation systems**, this associates people's preferences with others who have similar tastes, and recommends new items to them, such as books or movies. We will cover some of these techniques in the next videos. 

Difference between these buzzwords that we keep hearing these days, such as Artificial intelligence (or AI), Machine Learning and Deep Learning?”. In brief, 
* AI tries to make computers intelligent in order to mimic the cognitive functions of humans. So, Artificial Intelligence is a general field with a broad scope including: 
    * Computer Vision, 
    * Language Processing, 
    * Creativity, 
    * Summarization. 
* Machine Learning is the branch of AI that covers the **statistical part** of artificial intelligence. It teaches the computer to solve problems by looking at hundreds or thousands of examples, learning from them, and then using that experience to solve the same problem in new situations. 
* Deep Learning is a very special field of Machine Learning where computers can actually learn and make intelligent decisions on their own. Deep learning involves a deeper level of automation in comparison with most machine learning algorithms.

## Python for Machine Learning
You can write your machine-learning algorithms using Python, and it works very well. However, there are a lot of modules and libraries already implemented in Python, that can make your life much easier.
1. NumPy which is a math library to work with N-dimensional arrays in Python. It enables you to do computation efficiently and effectively. It is better than regular Python because of its amazing capabilities. For example, for working with arrays, dictionaries, functions, datatypes and working with images you need to know NumPy. 
2. SciPy is a collection of numerical algorithms and domain specific toolboxes, including signal processing, optimization, statistics and much more. SciPy is a good library for scientific and high performance computation. 
3. Matplotlib is a very popular plotting package that provides 2D plotting, as well as 3D plotting.
4. Pandas library is a very high-level Python library that provides high performance easy to use data structures. It has many functions for data importing, manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and timeseries. 
5. SciKit Learn is a collection of algorithms and tools for machine learning which is our focus here and which you'll learn to use within this course. As we'll be using SciKit Learn quite a bit in the labs, let me explain more about it and show you why it is so popular among data scientists. SciKit Learn is a free Machine Learning Library for the Python programming language.

## Supervised versus Unsupervised
**Supervise**, means to observe, and direct the execution of a task, project, or activity. Obviously we aren't going to be supervising a person, instead will be supervising a machine learning model that might be able to produce classification regions.
We do this by teaching the model, that is we load the model with knowledge so that we can have it predict future instances. But this leads to the next question which is, how exactly do we teach a model? We teach the model by training it with some data from a labeled dataset. 

**Unspuervised**, means we let the model work on its own to discover information that may not be visible to the human eye. It means, the unsupervised algorithm trains on the dataset, and draws conclusions on unlabeled data. Generally speaking, unsupervised learning has more difficult algorithms than supervised learning since we know little to no information about the data, or the outcomes that are to be expected. 
* Dimension reduction, 
* Density estimation, 
* Market basket analysis, 
* Clustering 
are the most widely used unsupervised machine learning techniques.  

|Supervised|Unsupervised|
|---|---|
|Classification: Classifies labeled data|Clustering: Finds pattern and groupings from unlabeled data|
|Regression: Predicts trends using previous labeled data||
|Has more evaluation methods than unsupervised learning|Has fewer evaluation methods than supervised learning|
|Controlled environment|Less controled environment|

[Home](#home)

<a name='reg'></a>
## Linear Regression
Regression is the process of predicting a continuous value. In regression, there are two types of variables, a dependent variable, and one or more independent variables. The dependent variable, can be seen as the state, target, or final goal we study, and try to predict, and the independent variables, also known as explanatory variables, can be seen as the causes of those states. The independent variables are shown conventionally by X, and the dependent variable is notated by Y. Our regression model relates Y, or the dependent variable, to a function of X i.e. The independent variables. The key point in the regression, is that our dependent value should be continuous and cannot be a discrete value.  

Basically, there are two types of regression models. Simple regression, and multiple regression. Simple regression is when one independent variable is used to estimate a dependent variable. It can be either linear, or non-linear.  

When more than one independent variable is present, the processes is called multiple linear regression

### Simple Linear Regression
Linear regression is the approximation of a linear model used to describe the relationship between two or more variables. In simple linear regression, there are two variables, a dependent variable and an independent variable. The key point in the linear regression is that our dependent value should be continuous and cannot be a discrete value.

* Simple linear regression is when one independent variable is used to estimate a dependent variable.
* When more than one independent variable is present the process is called multiple linear regression

With linear regression you can fit a line through the data. For instance, as the engine size increases, so do the emissions. With linear regression you can model the relationship of these variables. A good model can be used to predict what the approximate emission of each car is.

The fit line is shown traditionally as a polynomial. In a simple regression problem  

$ \hat{y} = \theta_0 + \theta_1 x_1$  

In this equation, $\hat{y}$ is the dependent variable or the predicted value. And $x_1$ is the independent variable. $\theta_0$ and $\theta_1$ are the parameters of the line that we must adjust. 
* $\theta_1$ is known as the slope or gradient of the fitting line
* $\theta_0$ is known as the intercept.

$\theta_0$ and $\theta_1$ are also called the coefficients of the linear equation.  

Now if you compare the actual value of the emission of the car with what we've predicted using our model, you will find out that we have a error. This means our prediction line is not accurate. This error is also called the residual error. The mean of all residual errors shows how poorly the line fits with the whole data set. Mathematically it can be shown by the equation Mean Squared Error, shown as MSE.  

$MSE = \frac{1}{n}\sum_{i=1}^n{(y_i-\hat{y_i})^2}$  

The objective is to find a line where the mean of all these errors is minimized. In other words, the mean error of the prediction using the fit line should be minimized. we have two options here. Options are:
1. use a mathematic approach 
2. use an optimization approach.

for Option 1:  
$ \theta_1 = \frac{\sum_{i=1}^{s}{(x_i-\overline{x})(y_i-\overline{y})}}{\sum_{i=1}^{s}{(x_i-\overline{x})^2}}$  

$ \theta_0 = \overline{y} - \theta_1 \overline{x}$  

### Model evaluation in Regression Models
The goal of regression is to build a model to accurately predict an unknown case. To this end, we have to perform regression evaluation after building the model.  
How can we calculate the accuracy of our model? In other words, how much can we trust this model for prediction of an unknown sample using a given dataset and having built a model such as linear regression? 

#### Train and test on the same dataset:  
    Select a portion of our dataset for testing. For instance, assume that we have 10 records in our dataset. We use the entire dataset for training, and we build a model using this training set. Now, we select a small portion of the dataset. This set is called a test set, which has actual value, but they are not used for prediction and is used only as ground truth. Now we pass the feature set of the testing portion to our built model and predict the target values. Finally, we compare the predicted values by our model with the actual values in the test set. Finally, we compare the predicted values by our model with the actual values in the test set. There are different metrics to report the accuracy of the mode (see next chapter).  
    This evaluation approach would most likely have a **high training accuracy** and the **low out-of-sample accuracy** since the model knows all of the testing data points from the training.
        
**Training accuracy** is the percentage of correct predictions that the model makes when using the test dataset. However, 
* a high training accuracy may result in an over-fit the data. This means that the model is overly trained to the dataset, which may capture noise and produce a non-generalized model. 
**Out-of-sample accuracy** is the percentage of correct predictions that the model makes on data that the model has not been trained on.  

Doing a train and test on the same dataset will most likely have low out-of-sample accuracy due to the likelihood of being over-fit. It's important that our models have high out-of-sample accuracy because the purpose of our model is, of course, to make correct predictions on unknown data.

#### Train/test split
In this approach, we select a portion of our dataset for training, and the rest is used for testing. The model is built on the training set. Then, the test feature set is passed to the model for prediction. Finally, the predicted values for the test set are compared with the actual values of the testing set. The second evaluation approach is called train/test split.

Train/test split involves splitting the dataset into training and testing sets respectively, which are **mutually exclusive**. After which, you train with the training set and test with the testing set. It is more realistic for real-world problems. This means that we know the outcome of each data point in the dataset.
The issue with train/test split is that it's highly dependent on the datasets on which the data was trained and tested. The variation of this causes train/test split to have a better out-of-sample prediction than training and testing on the same dataset, but it still has some problems due to this dependency

#### K-fold Cross validation
K-fold cross-validation, resolves most of the above issues. It fixes a high variation that results from a dependency by averaging it.

If we have K equals four folds, then we split up this dataset as shown here. 
1. In the first fold for example, we use the first 25 percent of the dataset for testing and the rest for training. The model is built using the training set and is evaluated using the test set. 
2. Then, in the next round or in the second fold, the second 25 percent of the dataset is used for testing and the rest for training the model. Again, the accuracy of the model is calculated. 
3. We continue for all folds. 
4. Finally, the result of all four evaluations are averaged.

### Evaluation Metrics in Regression Models
Evaluation metrics are used to explain the performance of a model. Evaluation metrics, provide a key role in the development of a model, as it provides insight to areas that require improvement. 
Model evaluation metrics includs elements such as
* mean absolute error, 
* mean squared error, 
* root mean squared error.
  
Definition of error:
In the context of regression, the error of the model is the difference between the data points and the trend line generated by the algorithm.

$y$ = actual labels or values  
$\hat{y}$ = predicted labels or value

1. **Mean absolute error (MAE)** is the mean of the absolute value of the errors.  
    $MAE = \frac{1}{n}\sum_{j=1}^n{|y_j-\hat{y}_j|}$  
    This is the easiest of the metrics to understand, since it's just the average error. 
2. **Mean squared error (MSE)** is the mean of the squared error.  
    $MSE = \frac{1}{n}\sum_{j=1}^n{(y_j-\hat{y}_j)^2}$  
    It's more popular because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones. 
3. **Root mean squared error (RMSE)** is the square root of the mean squared error.  
    $RMSE = \sqrt{\frac{1}{n}\sum_{j=1}^n{(y_j-\hat{y}_j)^2}}$  
    This is most popular form of the evaluation metrics because root mean squared error is interpretable in the same units as the response vector or y units, making it easy to relate its information. 
4. **Relative absolute error (RAE)**,   
    $RAE = \frac{\sum_{j=1}^n{|y_j-\hat{y}_j|}}{\sum_{j=1}^n{|y_j-\overline{y}_j|}}$  
    also known as residual sum of square, where $\overline{y}$ is a mean value of y, takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor. 
5. **Relative squared error (RSE)**  
    $RSE = \frac{\sum_{j=1}^n{(y_j-\hat{y}_j)^2}}{\sum_{j=1}^n{(y_j-\overline{y}_j)^2}}$  
    is very similar to relative absolute error but is widely adopted by the data science community, as it is used for calculating ($R^2$). 
6. R squared ($R^2$)  
    ($R^2 = 1 - RSE$)  
    is not an error per se but it represents how close the data values are to the fitted regression line. The higher the R-squared, the better the model fits your data. 

### Multiple Linear Regression
When multiple independent variables are present, the process is called multiple linear regression. Multiple linear regression is the extension of the simple linear regression model.
Basically, there are two applications for multiple linear regression. 
* First, it can be used when we would like to identify the strength of the effect that the independent variables have on the dependent variable. 
* Second, it can be used to predict the impact of changes, that is, to understand how the dependent variable changes when we change the independent variables.

Multiple linear regression is a method of predicting a continuous variable. It uses multiple variables called independent variables or predictors that best predict the value of the target variable which is also called the dependent variable. In multiple linear regression, the target value Y, is a linear combination of independent variables X.

Generally, the model is of the form  
$ \hat{y} = \theta_0 + \theta_1 x_1 + ... + \theta_n x_n$  

Mathematically, we can show it as a vector form as well. This means it can be shown as a dot product of two vectors;  
$ \hat{y} = \theta^T X$  
$ \theta^T = [\theta_0, \theta_1, \theta_2, ...]$ $ X = \begin{bmatrix}
1 \\
x_1 \\
x_2
\end{bmatrix}$

The parameters vector and the feature set vector. Generally, we can show the equation for a multidimensional space as theta transpose x, where theta is an n by one vector of unknown parameters in a multi-dimensional space, and x is the vector of the featured sets, as theta is a vector of coefficients and is supposed to be multiplied by x. Conventionally, it is shown as transpose theta.

The first element of the feature set would be set to one, because it turns that theta zero into the intercept or biased parameter when the vector is multiplied by the parameter vector.

The whole idea is to find the best fit hyperplane for our data. To this end and as is the case in linear regression, we should estimate the values for theta vector that best predict the value of the target field in each row. To achieve this goal, we have to minimize the error of the prediction. Optimized parameters are the ones which lead to a model with the fewest errors.

The mean of all residual errors shows how bad the model is representing the data set, it is called the mean squared error, or MSE. The objective of multiple linear regression is to minimize the MSE equation. To minimize it, we should find the best parameters $\theta$. The most common methods are the **ordinary least squares** and **optimization approach**.

* **Ordinary least squares** tries to estimate the values of the coefficients by minimizing the mean square error. This approach uses the data as a matrix and uses linear algebra operations to estimate the optimal values for the theta. The problem with this technique is the time complexity of calculating matrix operations as it can take a very long time to finish. 

* **Optimization algorithm** to find the best parameters. That is, you can use a process of optimizing the values of the coefficients by iteratively minimizing the error of the model on your training data. For example, you can use gradient descent which starts optimization with random values for each coefficient, then calculates the errors and tries to minimize it through y's changing of the coefficients in multiple iterations. Gradient descent is a proper approach if you have a large data set. 

After we found the parameters of the linear equation, making predictions is as simple as solving the equation for a specific set of inputs. Multiple linear regression **estimates the relative importance of predictors**.

Challenges:
1. Adding too many independent variables without any theoretical justification may result in an overfit model. An overfit model is a real problem because it is too complicated for your data set and not general enough to be used for prediction. So, it is recommended to avoid using many variables for prediction.
2. Should independent variables be continuous? Basically, categorical independent variables can be incorporated into a regression model by converting them into numerical variables. There needs to be a linear relationship between the dependent variable and each of your independent variables. If the relationship displayed in your scatter plot is not linear, then you need to use non-linear regression.

[Home](#home)

<a name='nonlin'></a>
## Non-linear Regression
If the data shows a curvy trend, then linear regression would not produce very accurate results when compared to a non-linear regression. Simply because, as the name implies, linear regression presumes that the data is linear: our job is to estimate the parameters of the model.

You can see a quadratic and cubic regression lines here, and it can go on and on to infinite degrees. In essence, we can call all of these polynomial regression, where the relationship between the independent variable X and the dependent variable Y is modeled as an Nth degree polynomial in X. With many types of regression to choose from, there's a good chance that one will fit your dataset well. Polynomial regression fits a curve line to your data.

$ \hat{y} = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$

Though the relationship between X and Y is non-linear here and polynomial regression can't fit them, a polynomial regression model can still be expressed as linear regression. Polynomial regression is considered to be a **special case of traditional multiple linear regression.** So, you can use the same mechanism as linear regression to solve such a problem. Therefore, polynomial regression models can fit using the model of least squares.

1. non-linear regression is a method to model a non-linear relationship between the dependent variable and a set of independent variables. 
2. for a model to be considered non-linear, $\hat{y}$ hat must be a non-linear function of the parameters $\theta$, not necessarily the features $X$.

That is, in non-linear regression, a model is non-linear by parameters. In contrast to linear regression, we cannot use the ordinary least squares method to fit the data in non-linear regression. In general, estimation of the parameters is not easy.

Challenges:
1. How can I know if a problem is linear or non-linear? 
    1.1 First visually figure out if the relation is linear or non-linear. It's best to plot bivariate plots of output variables with each input variable. 
    1.2 calculate the correlation coefficient between independent and dependent variables, and if, for all variables, it is 0.7 or higher, there is a linear tendency and thus, it's not appropriate to fit a non-linear regression.
    1.3 use non-linear regression instead of linear regression when we cannot accurately model the relationship with linear parameters
2. How should I model my data if it displays non-linear on a scatter plot? 
    Use either a polynomial regression, a non-linear regression model, or transform your data,

[Home](#home)

<a name='class'></a>
## Classification
In machine learning classification is a **supervised learning** approach which can be thought of as a means of categorizing or classifying some unknown items into a discrete set of classes. Classification attempts to learn the relationship between a set of feature variables and a target variable of interest. The target attribute in classification is a categorical variable with discrete values.

Given a set of training data points along with the target labels, classification determines the class label for an unlabeled test case. 

Types of classification algorithms and machine learning. They include 
* decision trees, 
* naive bayes,
* linear discriminant analysis, 
* k-nearest neighbor, 
* logistic regression, 
* neural networks, 
* support vector machines

<a name='knear'></a>
### K-Nearest Neighbours
K-Nearest Neighbors is an algorithm for supervised learning. Where the data is 'trained' with data points corresponding to their classification. Once a point is to be predicted, it takes into account the 'K' nearest points to it to determine it's classification. 
![k-Nearest](https://ibm.box.com/shared/static/mgkn92xck0z05v7yjq8pqziukxvc2461.png)


The K-Nearest Neighbors algorithm is a classification algorithm that takes a bunch of labeled points and uses them to learn how to label other points. This algorithm classifies cases based on their similarity to other cases. In K-Nearest Neighbors, data points that are near each other are said to be neighbors. K-Nearest Neighbors is based on this paradigm.
Similar cases with the same class labels are near each other. Thus, the distance between two cases is a measure of their dissimilarity. There are different ways to calculate the similarity or conversely, the distance or dissimilarity of two data points.

Process:  
1. Pick a value for K. 
2. Calculate the distance from the new case hold out from each of the cases in the dataset. 
3. Search for the K-observations in the training data that are nearest to the measurements of the unknown data point. And 
4. Predict the response of the unknown data point using the most popular response value from the K-Nearest Neighbors. 

Concerns:  
1. How to select the correct K ?
2. How to compute the similarity between cases?

Compute similarity, i.e. Euclidean distance (with 2 features):  
$ Dis_{(x_1,x_2)} = \sqrt{\sum_{i=0}^{n}{(x_{1i}-x_{2i})^2}}$

We can also use the same distance matrix for multidimensional vectors. Of course, we have to normalize our feature set to get the accurate dissimilarity measure. There are other dissimilarity measures as well that can be used for this purpose but as mentioned, it is highly dependent on datatype and also the domain that classification is done for it.

A low value of K causes a highly complex model as well, which might result in overfitting of the model. It means the prediction process is not generalized enough to be used for out-of-sample cases. And with a high value of K the model becomes overly generalized.
The general solution is to reserve a part of your data for testing the accuracy of the model. Once you've done so, choose K equals one and then use the training part for modeling and calculate the accuracy of prediction using all samples in your test set. Repeat this process increasing the K and see which K is best for your model. 

Nearest neighbors analysis can also be used to compute values for a continuous target. In this situation, the average or median target value of the nearest neighbors is used to obtain the predicted value for the new case.

<a name='metric'></a>
### Evaluation Metrics in Classification
Evaluation metrics explain the performance of a model. Evaluation metrics provide a key role in the development of a model, as they provide insight to areas that might require improvement. There are different model evaluation metrics but we just talk about three of them here, specifically: 
* Jaccard index  
    $J_{y,\hat{y}} = \frac{|y \cap \hat{y}|}{|y \cup \hat{y}|} = \frac{y \cap \hat{y}|}{|y|+|\hat{y}|-|y \cap \hat{y}|}$  
    The size of the intersection divided by the size of the union of two label sets.
* F1-score  
    ![Confusion matrix](https://2.bp.blogspot.com/-EvSXDotTOwc/XMfeOGZ-CVI/AAAAAAAAEiE/oePFfvhfOQM11dgRn9FkPxlegCXbgOF4QCLcBGAs/s1600/confusionMatrxiUpdated.jpg)
    This matrix shows the corrected and wrong predictions, in comparison with the actual labels. Each confusion matrix row shows the Actual/True labels in the test set, and the columns show the predicted labels by classifier.
    Precision is a measure of the accuracy, provided that a class label has been predicted. It is defined by: 
    * Precision = True Positive / (True Positive + False Positive). 
    * Recall = True positive / (True Positive + False Negative).  
    * F1-score = 2 * (Prc * Rec) / (Prc + Rec)  
    The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (which represents perfect precision and recall) and its worst at 0. 
* Log Loss  
    Sometimes, the output of a classifier is the probability of a class label, instead of the label. Logarithmic loss (also known as Log loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. We can calculate the log loss for each row using the log loss equation, which measures how far each prediction is, from the actual label. Then, we calculate the average log loss across all rows of the test set. The classifier with lower log loss has better accuracy.  
    
    $LogLoss = -\frac{1}{n}\sum{(y * log(\hat{y}))+(1-y)*(log{(1-\hat{y})})}$

<a name='tree'></a>
## Decision Trees
Decision trees are built by splitting the training set into distinct nodes, where one node contains all of or most of one category of the data. Decision trees are about testing an attribute and branching the cases based on the result of the test. Each internal node corresponds to a test, and each branch corresponds to a result of the test, and each leaf node assigns a patient to a class.  
A decision tree can be constructed by considering the attributes one by one. First, choose an attribute from our dataset. Calculate the significance of the attribute in the splitting of the data. Next, split the data based on the value of the best attribute, then go to each branch and repeat it for the rest of the attributes. After building this tree, you can use it to predict the class of unknown cases.

Decision trees are built using recursive partitioning to classify the data. Predictiveness is based on decrease in impurity of nodes. We're looking for the best feature to decrease the impurity, after splitting them up based on that feature. A node in the tree is considered pure if in 100 percent of the cases, the nodes fall into a specific category of the target field. In fact, the method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step. Impurity of nodes is calculated by entropy of data in the node. 
 Entropy is the amount of information disorder or the amount of randomness in the data. The entropy in the node depends on how much random data is in that node and is calculated for each node. In decision trees, we're looking for trees that have the smallest entropy in their nodes.  
The entropy is used to calculate the homogeneity of the samples in that node. If the samples are completely homogeneous, the entropy is zero and if the samples are equally divided it has an entropy of one.

The answer is the tree with the higher information gain after splitting. So, what is information gain? Information gain is the information that can increase the level of certainty after splitting. It is the entropy of a tree before the split minus the weighted entropy after the split by an attribute. We can think of information gain and entropy as opposites. As entropy or the amount of randomness decreases, the information gain or amount of certainty increases and vice versa. So, constructing a decision tree is all about finding attributes that return the highest information gain.

<a name='log'></a>
### Logistic Regression
Logistic regression is analogous to linear regression but tries to predict a categorical or discrete target field instead of a numeric one. In linear regression, we might try to predict a continuous value of variables
In logistic regression, we predict a variable which is binary such as yes/no, true/false, successful or not successful, pregnant/not pregnant, and so on, all of which can be coded as zero or one. In logistic regression dependent variables should be continuous. If categorical, they should be dummy or indicator coded. This means we have to transform them to some continuous value.

When to use logistic regression:
1.  First, when the target field in your data is categorical or specifically is binary. Such as zero/one, yes/no, churn or no churn, positive/negative and so on. 
2. Second, you need the probability of your prediction. For example, if you want to know what the probability is of a customer buying a product. Logistic regression returns a probability score between zero and one for a given sample of data. In fact, logistic regression predicts the probability of that sample and we map the cases to a discrete class based on that probability. 
3. Third, if your data is linearly separable. The decision boundary of logistic regression is a line or a plane or a hyper plane. A classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.
4. Fourth, you need to understand the impact of a feature. You can select the best features based on the statistical significance of the logistic regression model coefficients or parameters. That is, after finding the optimum parameters, a feature X with the weight Theta one close to zero has a smaller effect on the prediction than features with large absolute values of Theta one. Indeed, it allows us to understand the impact an independent variable has on the dependent variable while controlling other independent variables.

The goal of logistic regression is to build a model to predict the class of each customer and also the probability of each sample belonging to a class. Ideally, we want to build a model $\hat{y}$ that can estimate that the class of a customer is one given its feature is x.

The sigmoid function, also called the logistic function, resembles the step function and is used by the following expression in the logistic regression.

$\sigma (\theta^T X) = \frac{1}{1+e^{-\theta^TX}}$  

* if $\sigma (\theta^T X)$ =1, $e^{-\theta^TX}$ is large
* if $\sigma (\theta^T X)$ =0, $e^{-\theta^TX}$ is small

In the **sigmoid equation**, when Theta transpose x gets very big, the e power minus Theta transpose x in the denominator of the fraction becomes almost 0, and the value of the sigmoid function gets closer to 1. If Theta transpose x is very small, the sigmoid function gets closer to 0. The sigmoid functions output is always between 0 and 1, which makes it proper to interpret the results as probabilities. 

n logistic regression, we model the probability that an input, x, belongs to the default class y equals 1, and we can write this formally as  

* P(Y=1|X)
* P(Y=0|X) = 1 - P(Y=1|X)

Our job is to train the model to set its parameter values in such a way that our model is a good estimate of probability of y equals 1 given x. 

* $\sigma (\theta^T X) \rightarrow$ P(Y=1|X)
* $1 - \sigma (\theta^T X) \rightarrow$ P(Y=0|X)

How to build the training process:  
1 **Initialize $\theta$** vector with random values as with most machine learning algorithms. 
2. Calculate the model output $\hat{y} = \sigma (\theta^T X)$:  
    For example, customer in your training set. X and Theta transpose x is the feature vector values. For example, the age and income of the customer, for instance, 2 and 5, and Theta is the confidence or weight that you've set in the previous step. The output of this equation is the prediction value, in other words, the probability that the customer belongs to class 1. 
3. Compare the output of our model $\hat{y}$  
    This could be a value of, let's say, 0.7, with the actual label of the customer, which is for example, 1, for churn. Then, record the difference as our model's error for this customer, which would be 1 minus 0.7, which of course, equals 0.3. This is the error for only one customer out of all the customers in the training set. 
4. Calculate the error for all customers as we did in the previous steps and add up these errors. The total error is the cost of your model and is calculated by the models cost function.
5. Minimise the cost by changing $\theta$
6. Start another iteration and calculate the cost of the model again. We keep doing those steps over and over, changing the values of Theta each time until the cost is low enough.

The main objective of training and logistic regression is to change the parameters of the model, so as to be the best estimation of the labels of the samples in the dataset. By using the derivative of the cost function, we can find how to change the parameters to reduce the cost or rather the error.  
The cost function is the difference between the actual values of y and our model output y hat. This is a general rule for most cost functions in machine learning. We can show this as the cost of our model comparing it with actual labels, which is the difference between the predicted value of our model and actual value of the target field, where the predicted value of our model is sigmoid of theta transpose x.  

$Cost(\hat{y}, y) = \frac{1}{2}(\sigma (\theta^T X) - y)^2$  
$J(\theta) = \frac{1}{m}\sum_{i-1}^{m}{Cost(\hat{y}, y)}$  

Calculate the minimum point of this cost function and it will show us the best parameters for our model. Given the we should find another cost function which has the same behavior but is easier to find its minimum point.  

The minus log function provides such a cost function. It means if 
* the actual value (Y = 1) the model also predicts $Cost(\hat{y}, y)$, the minus log function $J(\theta)$ returns zero cost.  
![Minus log-function](https://ljvmiranda921.github.io/assets/png/cs231n-ann/neg_log.png)  

$J(\theta) = -\frac{1}{m}\sum_{i-1}^{m}{\hat{y}^{i}log(1-\hat{y}^{i})}$  

Use this function to find the parameters of our model in such a way as to minimize the cost by using an optimization approach. There are different optimization approaches, but we use one of the most famous and effective approaches here, **gradient descent**. Gradient descent is an iterative approach to finding the minimum of a function. Specifically in our case gradient descent is a technique to use the derivative of a cost function to change the parameter values to minimize the cost or error.

The gradient is the slope of the surface at every point and the direction of the gradient is the direction of the greatest uphill. The gradient value also indicates how big of a step to take. If the slope is large we should take a large step because we are far from the minimum. If the slope is small we should take a smaller step. Gradient descent takes increasingly smaller steps towards the minimum with each iteration.

A vector of all these slopes is the gradient vector, and we can use this vector to change or update all the parameters. We take the previous values of the parameters and subtract the error derivative. 

<a name='svm'></a>
### Support Vector Machine (SVM)
A Support Vector Machine is a supervised algorithm that can classify cases by finding a separator. SVM works by first mapping data to a high dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. Then, a separator is estimated for the data. The data should be transformed in such a way that a separator could be drawn as a hyperplane. The SVM algorithm outputs an optimal hyperplane that categorizes new examples.

Challenges:
* First, how do we transfer data in such a way that a separator could be drawn as a hyperplane? 
* Second, how can we find the best or optimized hyperplane separator after transformation?

The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as linear, polynomial, Radial Basis Function,or RBF, and sigmoid.

One reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes. So the goal is to choose a hyperplane with as big a margin as possible. Examples closest to the hyperplane are **support vectors**. It is intuitive that only support vectors matter for achieving our goal, other trending examples can be ignored. We tried to find the hyperplane in such a way that it has the maximum distance to support vectors.

The output of the algorithm is the values w and b for the line. You can make classifications using this estimated line. It is enough to plug in input values into the line equation. Then, you can calculate whether an unknown point is above or below the line. If the equation returns a value greater than 0, then the point belongs to the first class which is above the line, and vice-versa.  

Disadvantages:  
* Support Vector Machines are prone for over-fitting if the number of features is much greater than the number of samples.
* SVMs do not directly provide probability estimates, which are desirable in most classification problems. 
* SVMs are not very efficient computationally if your dataset is very big, such as when you have more than 1,000 rows.

Where to use it:
1. image analysis tasks, such as image classification and hand written digit recognition. 
2. very effective in text mining tasks, particularly due to its effectiveness in dealing with high-dimensional data. 
3. gene expression data classification, again, because of its power in high-dimensional data classification. 
4. for other types of machine learning problems, such as regression, outlier detection and clustering.

<a name='clust'></a>
## Clustering
Cluster segmentation is the practice of partitioning a group and/or base into groups of individuals that have similar characteristics. It is a significant strategy, as it allows the business to target specific groups more targeted.

A general segmentation process is not usually feasible for large volumes of very data, therefore you need an analytical approach to deriving segments and groups from large datasets. Clustering can group data only unsupervised, based on the similarity each other. 

Clustering means finding clusters in a dataset, unsupervised. A cluster is a group of data points or objects in a dataset that are similar to other objects in the group, and dissimilar to data points in other clusters. Classification algorithms predict categorical classed labels. This means assigning instances to predefined classes such as defaulted or not defaulted.
Generally speaking, classification is a supervised learning where each training data instance belongs to a particular class. In clustering however, the data is unlabeled and the process is unsupervised.

Generally clustering can be used for one of the following purposes: 
* exploratory data analysis, 
* summary generation or reducing the scale, 
* outlier detection- especially to be used for fraud detection or noise removal,
* finding duplicates and datasets or as a 
* pre-processing step for either prediction, other data mining tasks or as part of a complex system.

Different clustering algorithms and their characteristics. 
* **Partitioned-based clustering** is a group of clustering algorithms that produces fear like clusters, such as; 
    * K-Means, 
    * K-Medians or 
    * Fuzzy c-Means  
    These algorithms are relatively efficient and are used for medium and large sized databases. 
* **Hierarchical clustering** algorithms produce trees of clusters, such as agglomerative and divisive algorithms. This group of algorithms are very intuitive and are generally good for use with small size datasets. 
* **Density-based clustering** algorithms produce arbitrary shaped clusters. They are especially good when dealing with spatial clusters or when there is noise in your data set.

<a name='kclust'></a>
### k-Means Clustering
K-Means can group data only unsupervised based on the similarity. K-Means is a type of partitioning clustering, that is, it divides the data into K non-overlapping subsets or clusters without any cluster internal structure or labels. This means, it's an unsupervised algorithm. Objects within a cluster are very similar, and objects across different clusters are very different or dissimilar. The objective of K-Means is to form clusters in such a way that similar samples go into a cluster, and dissimilar samples fall into different clusters, it can be shown that instead of a similarity metric, we can use dissimilarity metrics. In other words, conventionally the distance of samples from each other is used to shape the clusters. 
K-Means tries to minimize the intra-cluster distances and maximize the inter-cluster distances

1. determine the number of clusters. The key concept of the K-Means algorithm is that it randomly picks a center point for each cluster. It means we must initialize K which represents number of clusters. Data points are called centroids of clusters and should be of same feature size of our customer feature set. 
2. assign each customer to the closest center. For this purpose, we have to calculate the distance of each data point or in our case each customer from the centroid points. Form a matrix where each row represents the distance of a customer from each centroid. It is called the Distance Matrix. The main objective of K-Means clustering is to minimize the distance of data points from the centroid of this cluster and maximize the distance from other cluster centroids.
3. Shape clusters in such a way that the total distance of all members of a cluster from its centroid be minimized in order to minimize error. Here, error is the total distance of each point from its centroid. It can be shown as within-cluster sum of squares error. 
4. In the next step, each cluster center will be updated to be the mean for datapoints in its cluster. Indeed, each centroid moves according to their cluster members. In other words the centroid of each of the three clusters becomes the new mean.
5. once again we will have to calculate the distance of all points from the new centroids. The points are reclustered and the centroids move again. This continues until the centroids no longer move. 

K-Means is an iterative algorithm and we have to repeat steps two to four until the algorithm converges. In each iteration, it will move the centroids, calculate the distances from new centroids and assign data points to the nearest centroid. It results in the clusters with minimum error or the most dense clusters. This means with randomized starting centroids, it may give a better outcome. As the algorithm is usually very fast, it wouldn't be any problem to run it multiple times.

Based on the objective of the k-Means. This value is the average distance between data points within a cluster. Also, average of the distances of data points from their cluster centroids can be used as a metric of error for the clustering algorithm. The correct choice of K is often ambiguous because it's very dependent on the shape and scale of the distribution of points in a dataset. Increasing K will always decrease the error. So, the value of the metric as a function of K is plotted and the elbow point is determined where the rate of decrease sharply shifts

<a name='hclust'></a>
### Hierarchical Clustering
Hierarchical clustering algorithms build a hierarchy of clusters where each node is a cluster consisting of the clusters of its daughter nodes. Strategies for hierarchical clustering generally fall into two types, **divisive and agglomerative**. 
1. Divisive is top down, so you start with all observations in a large cluster and break it down into smaller pieces. Think about divisive as dividing the cluster. 2. Agglomerative is the opposite of divisive. So it is bottom up, where each observation starts in its own cluster and pairs of clusters are merged together as they move up the hierarchy. Agglomeration means to amass or collect things, which is exactly what this does with the cluster. 

Hierarchical clustering is typically visualized as a dendrogram. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged. By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering. Essentially, hierarchical clustering does not require a prespecified number of clusters.

**Agglomerative cluster**
1. First, we want to create n clusters, one for each data point. Then, each point is assigned as a cluster.
2. Compute the distance proximity matrix which will be an n by n table. 
3. Run the following steps until the specified cluster number is reached, or until there is only one cluster left. 
    3.1 Merge the two nearest clusters. Distances are computed already in the proximity matrix. 
    3.2 Update the proximity matrix with the new values. 
    3.3 Stop after reaching the specified number of clusters, or there is only one cluster remaining with the result stored in a dendogram.

Calculate the distance between two clusters with one point each. we merge clusters in agglomerative clustering. Now the question is, how can we calculate the distance between clusters when there are multiple patients in each cluster?
Start transcript at 2 minutes 48 seconds2:48
We can use different criteria to find the closest clusters and merge them. In general, it completely depends on the data type, dimensionality of data and most importantly, the domain knowledge of the data set. In fact, different approaches to defining the distance between clusters distinguish the different algorithms. 
1. Single linkage is defined as the shortest distance between two points in each cluster, such as point a and b.
2. Complete linkage clustering. This time, we are finding the longest distance between the points in each cluster, such as the distance between point a and b.
3. average linkage clustering or the mean distance. This means we're looking at the average distance of each point from one cluster to every point in another cluster.
4. Centroid is the average of the feature sets of points in a cluster. This linkage takes into account the centroid of each cluster when determining the minimum distance.

<a name='dclust'></a>
### Density Based Clustering
Most of the traditional clustering techniques such as K-Means, hierarchical, and Fuzzy clustering can be used to group data in an unsupervised way. However, when applied to **tasks with arbitrary shaped clusters or clusters within clusters**, traditional techniques might not be able to achieve good results, that is elements in the same cluster might not share enough similarity or the performance may be poor.

Density-based clustering locates regions of high density that are separated from one another by regions of low density. Density in this context is defined as the number of points within a specified radius. A specific and very popular type of density-based clustering is DBSCAN. 

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. This technique is one of the most common clustering algorithms which works based on density of object. DBSCAN works on the idea that if a particular point belongs to a cluster it should be near to lots of other points in that cluster. It works based on two parameters: radius and minimum points

* R determines a specified radius that if it includes enough points within it, we call it a dense area. 
* M determines the minimum number of data points we want in a neighborhood to define a cluster. 

* Core point:  
    A data point is a core point if within our neighborhood of the point there are at least M points.
* Border point:  
    A data point is a border point if a) its neighbourhood contains less than M data points or b) it is reachable from some core point.  
* Outlier point:
    An outlier is a point that is not a core point and also is not close enough to be reachable from a core point

<a name='cbre'></a>
## Content-based Recommendation Engines
Recommender systems try to capture patterns and similar behaviors, to help predict what else you might like. One of the main advantages of using recommendation systems is that users get a broader exposure to many different products they might be interested in. This exposure encourages users towards continual usage or purchase of their product.

2 main types of recommendation systems: Content-based and collaborative filtering. 
* Content-based systems try to figure out what a user's favorite aspects of an item are, and then make recommendations on items that share those aspects
* Collaborative filtering techniques find similar groups of users, and provide recommendations based on similar tastes within that group.

In terms of implementing recommender systems, there are 2 types: Memory-based and Model-based. 
* In memory-based approaches, we use the entire user-item dataset to generate a recommendation system. It uses statistical techniques to approximate users or items
    * Pearson Correlation, 
    * Cosine Similarity 
    * Euclidean Distance, among others.
* In model-based approaches, a model of users is developed in an attempt to learn their preferences
    * regression, 
    * clustering, 
    * classification, and so on

### Content-based Recommender Systems
A Content-based recommendation system tries to recommend items to users based on their profile. The user's profile revolves around that user's preferences and tastes. Similarity or closeness of items is measured based on the similarity in the content of those items.

Advantages and Disadvantages of Content-Based Filtering  
**Advantages**  
* Learns user's preferences
* Highly personalized for the user
**Disadvantages**
* Doesn't take into account what others think of the item, so low quality item recommendations might happen
* Extracting data is not always intuitive
* Determining what characteristics of the item the user dislikes or likes is not always obvious

<a name='colfil'></a>
### Collaborative Filtering
Collaborative filtering is based on the fact that relationships exist between products and people's interests. Many recommendation systems use collaborative filtering to find these relationships and to give an accurate recommendation of a product that the user might like or be interested in. Collaborative filtering has basically two approaches: user-based and item-based

* User-based collaborative filtering is based on the user similarity or neighborhood. Collaborative filtering basis this similarity on things like history, preference, and choices that users make when buying, watching, or enjoying something.
* Item-based collaborative filtering is based on similarity among items.

Challanges:
1. **Data sparsity** happens when you have a large data set of users who generally rate only a limited number of items. As mentioned, collaborative based recommenders can only predict scoring of an item if there are other users who have rated it. Due to sparsity, we might not have enough ratings in the user item dataset which makes it impossible to provide proper recommendations. 
2. **Cold start** refers to the difficulty the recommendation system has when there is a new user, and as such a profile doesn't exist for them yet. Cold start can also happen when we have a new item which has not received a rating. 
3. **Scalability** can become an issue as well. As the number of users or items increases and the amount of data expands, collaborative filtering algorithms will begin to suffer drops in performance, simply due to growth and the similarity computation.  

**Advantages**
* Takes other user's ratings into consideration
* Doesn't need to study or extract information from the recommended item
* Adapts to the user's interests which might change over time
**Disadvantages**
* Approximation function can be slow
* There might be a low of amount of users to approximate
* Privacy issues when trying to learn the user's preferences

[Home](#home)

<a name='lab'></a>
## Lab Exercise
<a name='lab_lin'></a>
### Linear Regression
1. Import the libraries:  
    import matplotlib.pyplot as plt  
    import pandas as pd  
    import pylab as pl  
    import numpy as np  
    %matplotlib inline
2. Get the data  
    !wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv  
    df = pd.read_csv("FuelConsumption.csv")
3. Data exploration  
    df.head()  
    df.describe()
    
    cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]  
    viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]  
    viz.hist()  
    plt.show()  
    
    plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS,  color='blue')  
    plt.xlabel("FUELCONSUMPTION_COMB")  
    plt.ylabel("Emission")  
    plt.show()  
    
    plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS,  color='blue')  
    plt.xlabel("Engine size")  
    plt.ylabel("Emission")  
    plt.show()
4. Creating training and test dataset (80% for training, 20% for testing)  
    msk = np.random.rand(len(df)) < 0.8  
    train = cdf[msk]  
    test = cdf[~msk]
5. Modeling  
    from sklearn import linear_model  
    regr = linear_model.LinearRegression()  
    train_x = np.asanyarray(train[['ENGINESIZE']])  
    train_y = np.asanyarray(train[['CO2EMISSIONS']])  
    regr.fit (train_x, train_y)  
    
    print ('Coefficients: ', regr.coef_)  
    print ('Intercept: ',regr.intercept_)
6. Plot the regression line  
    plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')  
    plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')  
    plt.xlabel("Engine size")  
    plt.ylabel("Emission")
7. Evaluation  
    from sklearn.metrics import r2_score  
    
    test_x = np.asanyarray(test[['ENGINESIZE']])  
    test_y = np.asanyarray(test[['CO2EMISSIONS']])  
    test_y_hat = regr.predict(test_x)  
    
    print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_hat - test_y)))  
    print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_hat - test_y) ** 2))  
    print("R2-score: %.2f" % r2_score(test_y_hat , test_y) )

<a name='lab_mult'></a>
### Multiple Linear Regression
1. Import the libraries:  
    import matplotlib.pyplot as plt  
    import pandas as pd  
    import pylab as pl  
    import numpy as np  
    %matplotlib inline
2. Get the data  
    !wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv  
    df = pd.read_csv("FuelConsumption.csv")
3. Data exploration  
    df.head()  
    df.describe()
    
    cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
    
    plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS,  color='blue')  
    plt.xlabel("Engine size")  
    plt.ylabel("Emission")  
    plt.show()
4. Creating training and test dataset (80% for training, 20% for testing)  
    msk = np.random.rand(len(df)) < 0.8  
    train = cdf[msk]  
    test = cdf[~msk]
5. Plot the "training" dataset distribution  
    plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')  
    plt.xlabel("Engine size")  
    plt.ylabel("Emission")  
    plt.show()
6. Modeling  
    from sklearn import linear_model  
    regr = linear_model.LinearRegression()  
    x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])  
    y = np.asanyarray(train[['CO2EMISSIONS']])  
    regr.fit (x, y)  
    print ('Coefficients: ', regr.coef_)
7. Prediction  
    y_hat= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])  
    x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])  
    y = np.asanyarray(test[['CO2EMISSIONS']])  
    print("Residual sum of squares: %.2f"  
      % np.mean((y_hat - y) ** 2))   

<a name='lab_pol'></a>
### Polynomial Regression
1. Repeat step 1 - 5 from above

**PloynomialFeatures()** function in Scikit-learn library, drives a new feature sets from the original feature set. That is, a matrix will be generated consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, lets say the original feature set has only one feature, ENGINESIZE. Now, if we select the degree of the polynomial to be 2, then it generates 3 features, degree=0, degree=1 and degree=2:

2. Polynomial regression  
    from sklearn.preprocessing import PolynomialFeatures  
    from sklearn import linear_model  
    train_x = np.asanyarray(train[['ENGINESIZE']])  
    train_y = np.asanyarray(train[['CO2EMISSIONS']])  
    
    test_x = np.asanyarray(test[['ENGINESIZE']])  
    test_y = np.asanyarray(test[['CO2EMISSIONS']])  
    
    poly = PolynomialFeatures(degree=2)  
    train_x_poly = poly.fit_transform(train_x)  
    train_x_poly
    
**fit_transform** takes our x values, and output a list of our data raised from power of 0 to power of 2 (since we set the degree of our polynomial to 2).

$
\begin{bmatrix}
    v_1\\
    v_2\\
    \vdots\\
    v_n
\end{bmatrix}
$
$\longrightarrow$
$
\begin{bmatrix}
    [ 1 & v_1 & v_1^2]\\
    [ 1 & v_2 & v_2^2]\\
    \vdots & \vdots & \vdots\\
    [ 1 & v_n & v_n^2]
\end{bmatrix}
$

Just consider replacing the  $x$ with $x_1$, $x_1^2$ with $x_2$, and so on. Then the degree 2 equation would be turn into:

$y = b + \theta_1  x_1 + \theta_2 x_2$

3. Modeling with linear regression  
    clf = linear_model.LinearRegression()  
    train_y_ = clf.fit(train_x_poly, train_y)  
    
    print ('Coefficients: ', clf.coef_)  
    print ('Intercept: ',clf.intercept_)

4. Prediction  
    plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')  
    XX = np.arange(0.0, 10.0, 0.1)  
    yy = clf.intercept_[0]+ clf.coef_[0][1]*XX+ clf.coef_[0][2]*np.power(XX, 2)  
    plt.plot(XX, yy, '-r' )  
    plt.xlabel("Engine size")  
    plt.ylabel("Emission")
5. Evaluation  
    from sklearn.metrics import r2_score  
    
    test_x_poly = poly.fit_transform(test_x)  
    test_y_ = clf.predict(test_x_poly)  
    
    print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))  
    print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))  
    print("R2-score: %.2f" % r2_score(test_y_ , test_y) )

<a name='lab_nonlin'></a>
### Non-Linear Regression
1. Import the libraries  
    import numpy as np  
    import pandas as pd
    import matplotlib.pyplot as plt  
    %matplotlib inline
2. Examples of non-linear regression
    2.1 Cubic functions:  
        x = np.arange(-5.0, 5.0, 0.1)  
        y = 1*(x**3) + 1*(x**2) + 1*x + 3  
        y_noise = 20 * np.random.normal(size=x.size)  
        ydata = y + y_noise  
        plt.plot(x, ydata,  'bo')  
        plt.plot(x,y, 'r')  
        plt.ylabel('Dependent Variable')  
        plt.xlabel('Indepdendent Variable')  
        plt.show()
    2.2 Quadratic functions:  
        x = np.arange(-5.0, 5.0, 0.1)  
        y = np.power(x,2)  
        y_noise = 2 * np.random.normal(size=x.size)  
        ydata = y + y_noise  
        plt.plot(x, ydata,  'bo')  
        plt.plot(x,y, 'r')  
        plt.ylabel('Dependent Variable')  
        plt.xlabel('Indepdendent Variable')  
        plt.show()
    2.3 Exponential functions  
        X = np.arange(-5.0, 5.0, 0.1)  
        Y= np.exp(X)  
        plt.plot(X,Y)  
        plt.ylabel('Dependent Variable')  
        plt.xlabel('Indepdendent Variable')  
        plt.show()
    2.4 Logarithmic functions:  
        X = np.arange(-5.0, 5.0, 0.1)  
        Y = np.log(X)  
        plt.plot(X,Y)  
        plt.ylabel('Dependent Variable')  
        plt.xlabel('Indepdendent Variable')  
        plt.show()
    2.5 Logistic functions  
        X = np.arange(-5.0, 5.0, 0.1)  
        Y = 1-4/(1+np.power(3, X-2))  
        plt.plot(X,Y)  
        plt.ylabel('Dependent Variable')  
        plt.xlabel('Indepdendent Variable')  
        plt.show()
3. Import the data set  
    !wget -nv -O china_gdp.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv  
    df = pd.read_csv("china_gdp.csv")  
4. Data exploration  
    df.head(10)  
5. Plot the dataset  
    plt.figure(figsize=(8,5))  
    x_data, y_data = (df["Year"].values, df["Value"].values)  
    plt.plot(x_data, y_data, 'ro')  
    plt.ylabel('GDP')  
    plt.xlabel('Year')  
    plt.show() 
6. Choosing a model -> From an initial look at the plot, we determine that the logistic function could be a good approximation  
    X = np.arange(-5.0, 5.0, 0.1)  
    Y = 1.0 / (1.0 + np.exp(-X))  
    plt.plot(X,Y)  
    plt.ylabel('Dependent Variable')  
    plt.xlabel('Indepdendent Variable')  
    plt.show()
7. Building a model ->  build our regression model and initialize its parameters  
    def sigmoid(x, Beta_1, Beta_2):  
        y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))  
        return y
    
    beta_1 = 0.10  
    beta_2 = 1990.0  
    
    Y_pred = sigmoid(x_data, beta_1 , beta_2)  
    
    plt.plot(x_data, Y_pred*15000000000000.)  
    plt.plot(x_data, y_data, 'ro')  
    
    Normalise the data:  
    xdata =x_data/max(x_data)  
    ydata =y_data/max(y_data)  
    
    Use curve_fit which uses non-linear least squares to fit our sigmoid function, to data:  
    from scipy.optimize import curve_fit  
    popt, pcov = curve_fit(sigmoid, xdata, ydata)  
    
    print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))
8. Plot the optimised model  
    x = np.linspace(1960, 2015, 55)  
    x = x/max(x)  
    plt.figure(figsize=(8,5))  
    y = sigmoid(x, *popt)  
    plt.plot(xdata, ydata, 'ro', label='data')  
    plt.plot(x,y, linewidth=3.0, label='fit')  
    plt.legend(loc='best')  
    plt.ylabel('GDP')  
    plt.xlabel('Year')  
    plt.show()

<a name='lab_knear'></a>
### K-Nearest Neighbours
1. Import the libraries  
    import itertools  
    import numpy as np  
    import matplotlib.pyplot as plt  
    from matplotlib.ticker import NullFormatter  
    import pandas as pd  
    import numpy as np  
    import matplotlib.ticker as ticker  
    from sklearn import preprocessing  
    %matplotlib inline  
2. Get the data  
    !wget -O teleCust1000t.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/teleCust1000t.csv  
3. Explore the dataset  
    df = pd.read_csv('teleCust1000t.csv')  
    df.head()  
    df['custcat'].value_counts()  
    df.hist(column='income', bins=50)  
    df.columns
4. Define feature sets (split between independent and dependent variable) 
    X = df[['region', 'tenure','age', 'marital', 'address', 'income', 'ed', 'employ','retire', 'gender', 'reside']] .values  #.astype(float)  
    X[0:5]  
    y = df['custcat'].values  
    y[0:5]
5. Normalise the dataset  
    X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))  
    X[0:5]
6. Split between Train and Test  
    from sklearn.model_selection import train_test_split  
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)  
    print ('Train set:', X_train.shape,  y_train.shape)  
    print ('Test set:', X_test.shape,  y_test.shape)
7. Classification: K-Nearest neigbour  
    from sklearn.neighbors import KNeighborsClassifier
8. Training  
    k = 4  
    neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)  
    neigh  
9. Predicting ( $\hat_{y}$ )  
    yhat = neigh.predict(X_test)  
    yhat[0:5]
10. Accuracy evaluation  
    from sklearn import metrics  
    print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))  
    print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
    
    Loop it with different "k's" to find the optimal:  
    Ks = 10  
    mean_acc = np.zeros((Ks-1))
    std_acc = np.zeros((Ks-1))  
    ConfustionMx = [];  
    for n in range(1,Ks):
        neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)  
        yhat=neigh.predict(X_test)
        mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)  
        std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])  
    mean_acc  
    
    plt.plot(range(1,Ks),mean_acc,'g')  
    plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)  
    plt.legend(('Accuracy ', '+/- 3xstd'))  
    plt.ylabel('Accuracy ')  
    plt.xlabel('Number of Nabors (K)')  
    plt.tight_layout()  
    plt.show()  
    
    print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1) 

<a name='lab_tree'></a>
### Decision Trees
1. Import libraries  
    import numpy as np  
    import pandas as pd  
    from sklearn.tree import DecisionTreeClassifier  
2. Get the data  
    !wget -O drug200.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv  
3. Explore the data  
    my_data = pd.read_csv("drug200.csv", delimiter=",")  
    my_data[0:5]
4. Explore the data  
    my_data.describe()  
    my_data.count() 
    my_data.shape()
5. Define the feature Matrix (X) and the response vector (y)  
    X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values  
    X[0:5]
6. Transform categorical variables into dummies with pandas.get_dummies()  
    from sklearn import preprocessing  
    le_sex = preprocessing.LabelEncoder()  
    le_sex.fit(['F','M'])  
    X[:,1] = le_sex.transform(X[:,1])  
    
    le_BP = preprocessing.LabelEncoder()  
    le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])  
    X[:,2] = le_BP.transform(X[:,2])  
    
    le_Chol = preprocessing.LabelEncoder()  
    le_Chol.fit([ 'NORMAL', 'HIGH'])  
    X[:,3] = le_Chol.transform(X[:,3]) 
    X[0:5]
7. Define the response vector  
    y = my_data["Drug"]  
    y[0:5]
8. Setting up the decision tree  
    from sklearn.model_selection import train_test_split  
    X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)
9. Check that the dimensions fit  
    X_trainset.shape  
    y_trainset.shape  
    
    and  
    
    X_testset.shape  
    y_testset.shape
10. Modelling
    drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)  
    drugTree # it shows the default parameters
11. Fit the trainset  
    drugTree.fit(X_trainset,y_trainset)
12. Prediction  
    predTree = drugTree.predict(X_testset)  
    print (predTree [0:5])  
    print (y_testset [0:5])  
13. Evaluation  
    from sklearn import metrics  
    import matplotlib.pyplot as plt  
    print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))
14. Visualise: 
    Install libraries  
    !conda install -c conda-forge pydotplus -y  
    !conda install -c conda-forge python-graphviz -y  
    
    from sklearn.externals.six import StringIO  
    import pydotplus  
    import matplotlib.image as mpimg  
    from sklearn import tree  
    %matplotlib inline  
    
    dot_data = StringIO()  
    filename = "drugtree.png"  
    featureNames = my_data.columns[0:5]  
    targetNames = my_data["Drug"].unique().tolist()  
    out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    graph.write_png(filename)  
    img = mpimg.imread(filename)  
    plt.figure(figsize=(100, 200))  
    plt.imshow(img,interpolation='nearest')

<a name='lab_log'></a>
### Logistic Regression
Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:

<img
src="https://ibm.box.com/shared/static/kgv9alcghmjcv97op4d6onkyxevk23b1.png" width="400" align="center">


The objective of __Logistic Regression__ algorithm, is to find the best parameters θ, for $ℎ_\theta(𝑥)$ = $\sigma({\theta^TX})$, in such a way that the model best predicts the class of each case.

$$
ℎ_\theta(𝑥) = \sigma({\theta^TX}) =  \frac {e^{(\theta_0 + \theta_1  x_1 + \theta_2  x_2 +...)}}{1 + e^{(\theta_0 + \theta_1  x_1 + \theta_2  x_2 +\cdots)}}
$$
Or:
$$
Probability\ of \ a \ Class_1 =  P(Y=1|X) = \sigma({\theta^TX}) = \frac{e^{\theta^TX}}{1+e^{\theta^TX}} 
$$  

1. Import the libraries:  
    import pandas as pd  
    import pylab as pl  
    import numpy as np  
    import scipy.optimize as opt  
    from sklearn import preprocessing  
    %matplotlib inline  
    import matplotlib.pyplot as plt 
2. Get the dataset:  
    !wget -O ChurnData.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv  
    churn_df = pd.read_csv("ChurnData.csv")
3. Explore the dataset  
    churn_df.head()  
    churn_df.shape  
    churn_df.describe()  
    churn_df.count()
4. Pre-processing
    4.1 Define X, y and standardise:  
        X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])  
        X[0:5]  
        y = np.asarray(churn_df['churn'])  
        y [0:5]  
        X = preprocessing.StandardScaler().fit(X).transform(X)
5. Build the train and test dataset:  
    from sklearn.model_selection import train_test_split  
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)  
    print ('Train set:', X_train.shape,  y_train.shape)  
    print ('Test set:', X_test.shape,  y_test.shape)
6. Modeling the logistic regression  
    LogisticRegression from Scikit-learn package can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers and supports regularisation. Regularization is a technique used to solve the overfitting problem in machine learning models. C parameter indicates inverse of regularization strength which must be a positive float. Smaller values specify stronger regularization.  
    
    from sklearn.linear_model import LogisticRegression  
    from sklearn.metrics import confusion_matrix  
    LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)  
    LR  
    
    yhat = LR.predict(X_test)  
    yhat  
    
    yhat_prob = LR.predict_proba(X_test)  
    yhat_prob
    
6. Evaluation:  
    6.1 Jaccard index, which is the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.  
    
    from sklearn.metrics import jaccard_similarity_score  
    jaccard_similarity_score(y_test, yhat)
    
    6.2 Confusion matrix
    see module ML0101EN in Jupyter

<a name='lab_svm'></a>
### Support Vector Machines
1. Import the libraries:  
    import pandas as pd  
    import pylab as pl  
    import numpy as np  
    import scipy.optimize as opt  
    from sklearn import preprocessing  
    from sklearn.model_selection import train_test_split  
    %matplotlib inline 
    import matplotlib.pyplot as plt
2. Get the dataset:  
    !wget -O cell_samples.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv  
    cell_df = pd.read_csv("cell_samples.csv")
3. Explore the dataset:  
    ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');  
    cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);  
    plt.show()  
    
    cell_df.dtypes
4. Clean the dataset:  
    cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]  
    cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')  
    cell_df.dtypes  
    
    feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]  
    X = np.asarray(feature_df)  
    X[0:5]  
    
    cell_df['Class'] = cell_df['Class'].astype('int')  
    y = np.asarray(cell_df['Class'])  
    y [0:5]
5. Build the train and the data set  
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)  
    print ('Train set:', X_train.shape,  y_train.shape)  
    print ('Test set:', X_test.shape,  y_test.shape)  
6. Modelling:  
    from sklearn import svm  
    clf = svm.SVC(kernel='rbf')  """rbf = Radial Base Function
    clf.fit(X_train, y_train) 
    
    yhat = clf.predict(X_test)  
    yhat [0:5]
7. Evaluation:  
    see Module ML0101

<a name='lab_kclu'></a>
### Customer segmentation with k-Means Clustering
1. Import the libraries:  
    import random  
    import numpy as np  
    import matplotlib.pyplot as plt  
    from sklearn.cluster import KMeans  
    from sklearn.datasets.samples_generator import make_blobs 
    %matplotlib inline
2. Get the dataset:  
    !wget -O Cust_Segmentation.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/Cust_Segmentation.csv  
    cust_df = pd.read_csv("Cust_Segmentation.csv")
3. Explore the data and prepare  
    df = cust_df.drop('Address', axis=1)  
    df.head()
4. Normalising over the standard deviation  
    from sklearn.preprocessing import StandardScaler  
    X = df.values[:,1:]  
    X = np.nan_to_num(X)  
    Clus_dataSet = StandardScaler().fit_transform(X)  
    Clus_dataSet
5. Modeling:  
    clusterNum = 3  
    k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)  
    k_means.fit(X)  
    labels = k_means.labels_
    print(labels)
6. Assign the labels to each row  
    df["Clus_km"] = labels  
    df.head(5)
7. Insights:  
    df.groupby('Clus_km').mean()  
    
    area = np.pi * ( X[:, 1])**2  
    plt.scatter(X[:, 0], X[:, 3], s=area, c=labels.astype(np.float), alpha=0.5)  
    plt.xlabel('Age', fontsize=18)  
    plt.ylabel('Income', fontsize=16)  
    plt.show()  
    
    from mpl_toolkits.mplot3d import Axes3D  
    fig = plt.figure(1, figsize=(8, 6))  
    plt.clf()  
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)  
    plt.cla()  
    ax.set_xlabel('Education')  
    ax.set_ylabel('Age')  
    ax.set_zlabel('Income')  
    ax.scatter(X[:, 1], X[:, 0], X[:, 3], c= labels.astype(np.float))

<a name='lab_hclust'></a>
### Hierarchical Clustering
1. Import the libraries:  
    import numpy as np  
    import pandas as pd  
    from scipy import ndimage  
    from scipy.cluster import hierarchy  
    from scipy.spatial import distance_matrix  
    from matplotlib import pyplot as plt  
    from sklearn import manifold, datasets  
    from sklearn.cluster import AgglomerativeClustering  
    from sklearn.datasets.samples_generator import make_blobs  
    %matplotlib inline
2. Get the data:  
    !wget -O cars_clus.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cars_clus.csv  
    pdf = pd.read_csv(filename)  
3. Explore and prepare the data set:  
    print ("Shape of dataset before cleaning: ", pdf.size)  
    pdf[[ 'sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']] = pdf[['sales', 'resale', 'type', 'price', 'engine_s',
       'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
       'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')  
       pdf = pdf.dropna()  
       pdf = pdf.reset_index(drop=True)  
       print ("Shape of dataset after cleaning: ", pdf.size)  
       pdf.head(5)
4. Feature definition:  
    featureset = pdf[['engine_s',  'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]
5. Normalisation:  
    from sklearn.preprocessing import MinMaxScaler  
    x = featureset.values #returns a numpy array  
    min_max_scaler = MinMaxScaler()  
    feature_mtx = min_max_scaler.fit_transform(x)  
    feature_mtx [0:5]
6. Build the distance matrix:  
    import scipy  
    leng = feature_mtx.shape[0]  
    D = scipy.zeros([leng,leng])  
    for i in range(leng):  
        for j in range(leng):  
            D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])
7. Clustering  
    import pylab  
    import scipy.cluster.hierarchy  
    Z = hierarchy.linkage(D, 'complete')
    
    from scipy.cluster.hierarchy import fcluster  
    k = 5  
    clusters = fcluster(Z, k, criterion='maxclust')  
    clusters
8. Plot the dendrogram:  
    fig = pylab.figure(figsize=(18,50))
    def llf(id):  
        return '[%s %s %s]' % (pdf['manufact'][id], pdf['model'][id], int(float(pdf['type'][id])) )  
    dendro = hierarchy.dendrogram(Z,  leaf_label_func=llf, leaf_rotation=0, leaf_font_size =12, orientation = 'right')


<a name='lab_dclust'></a>
### Density-Based Clustering
1. Import the libraries:  
    import numpy as np  
    from sklearn.cluster import DBSCAN  
    from sklearn.datasets.samples_generator import make_blobs  
    from sklearn.preprocessing import StandardScaler  
    import matplotlib.pyplot as plt  
    %matplotlib inline
2. Get the dataset:  
    !wget -O weather-stations20140101-20141231.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/weather-stations20140101-20141231.csv  
    filename='weather-stations20140101-20141231.csv'
    pdf = pd.read_csv(filename)
3. Explore the dataset and prepare:  
    pdf = pdf[pd.notnull(pdf["Tm"])]  
    pdf = pdf.reset_index(drop=True)  
    pdf.head(5)
4. Visulaisation see the file ML0101
5. 


<a name='lab_cbre'></a>
### Content-based Recommender Systems
1. Get the dataset  
    !wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip  
    print('unziping ...')  
    !unzip -o -j moviedataset.zip 
2. Importing the libraries  
    import pandas as pd  
    from math import sqrt  
    import numpy as np  
    import matplotlib.pyplot as plt  
    %matplotlib inline
3. Exploring the dataset  
    movies_df = pd.read_csv('movies.csv')  
    ratings_df = pd.read_csv('ratings.csv')  
    movies_df.head()
4. Cleaning up the dataset
    movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)  
    movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)  
    movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')  
    movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())  
    movies_df.head()  
    
    movies_df['genres'] = movies_df.genres.str.split('|')  
    movies_df.head()  
    
    Keeping genres in a list format isn't optimal for the content-based recommendation system technique, we will use the **One Hot Encoding** technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature.  
    
    moviesWithGenres_df = movies_df.copy()  
    for index, row in movies_df.iterrows():  
        for genre in row['genres']:  
            moviesWithGenres_df.at[index, genre] = 1  
    moviesWithGenres_df = moviesWithGenres_df.fillna(0)  
    moviesWithGenres_df.head()  
    
    ratings_df = ratings_df.drop('timestamp', 1)  
    ratings_df.head()
5. Content based recommender system:  
    userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}]  
            inputMovies = pd.DataFrame(userInput)  
            inputMovies  
            
     Extract the input movie's ID's from the movies dataframe and add them into it.  
     inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]  
     inputMovies = pd.merge(inputId, inputMovies)  
     inputMovies = inputMovies.drop('genres', 1).drop('year', 1)  
     inputMovies  
     
     Start by learning the input's preferences  
     userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]  
     
     userMovies = userMovies.reset_index(drop=True)  
     userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)  
     userGenreTable  
     
     Start learning the input's preferences  
     userProfile = userGenreTable.transpose().dot(inputMovies['rating'])  
     userProfile  
     
     genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])  
     genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)  
     genreTable.head()  
     
     Take the weighted average of every movie based on the input profile and recommend the top twenty movies  
     recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())  
     recommendationTable_df = recommendationTable_df.sort_values(ascending=False)  
     recommendationTable_df.head()
6. The final recommendation table:  
    movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

<a name='lab_colfil'></a>
### Collaborative Filtering
1. Get the dataset:  
    !wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip  
    print('unziping ...')  
    !unzip -o -j moviedataset.zip 
2. Import the libraries:  
    import pandas as pd  
    from math import sqrt  
    import numpy as np  
    import matplotlib.pyplot as plt  
    %matplotlib inline
3. Explore the dataset:  
    movies_df = pd.read_csv('movies.csv')  
    ratings_df = pd.read_csv('ratings.csv')
    movies_df.head()
4. Prepare the data set:  
    movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)  
    movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)  
    movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')  
    movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())  
    movies_df = movies_df.drop('genres', 1)  
    movies_df.head()  
    
    ratings_df = ratings_df.drop('timestamp', 1)  
    ratings_df.head()
5. Collaborative Filtering  
    **User-User Filtering**  
    * Select a user with the movies the user has watched
    * Based on his rating to movies, find the top X neighbours 
    * Get the watched movie record of the user for each neighbour.
    * Calculate a similarity score using some formula
    * Recommend the items with the highest score  
    
    userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5} ]  
            inputMovies = pd.DataFrame(userInput)  
            inputMovies
            
    Add MovieId
    inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]  
    inputMovies = pd.merge(inputId, inputMovies)  
    inputMovies = inputMovies.drop('year', 1)  
    inputMovies
    
    get the subset of users that have watched and reviewed the movies  
    userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]  
    userSubset.head()  
    
    Group by UserId  
    userSubsetGroup = userSubset.groupby(['userId'])  
    userSubsetGroup.get_group(1130)  
    
    Similarity of users to input user
    Why Pearson Correlation? Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems.  
    
    Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient  
    pearsonCorrelationDict = {}  
    for name, group in userSubsetGroup:  
        group = group.sort_values(by='movieId')  
        inputMovies = inputMovies.sort_values(by='movieId')  
        nRatings = len(group)  
        temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]  
        tempRatingList = temp_df['rating'].tolist()  
        tempGroupList = group['rating'].tolist()  
        Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)  
        Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)  
        Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)  
    if Sxx != 0 and Syy != 0:  
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)  
    else:  
        pearsonCorrelationDict[name] = 0  
    
    pearsonCorrelationDict.items()  
    pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')  
    pearsonDF.columns = ['similarityIndex']  
    pearsonDF['userId'] = pearsonDF.index  
    pearsonDF.index = range(len(pearsonDF))  
    pearsonDF.head()  
    
    The top x similar users to input user  
    topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]  
    topUsers.head()  
    
    Rating of selected users to all movies  
    topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')  
    topUsersRating.head()  
    
    Multiplies the similarity by the user's ratings  
    topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']  
    topUsersRating.head()  
    
    Applies a sum to the topUsers after grouping it up by userId  
    tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]  
    tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']  
    tempTopUsersRating.head() 
    
    Build the recommendation  
    recommendation_df = pd.DataFrame()  
    recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']  
    recommendation_df['movieId'] = tempTopUsersRating.index  
    recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
    recommendation_df.head()  
    
    movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]



[Home](#home)