# 1. What is Machine Learning?

In 1959, Arthur Samuel, a computer scientist who pioneered the study of artificial intelligence, described machine learning as “the study that gives computers the ability to learn without being explicitly programmed.”

Machine Learning is an application of artificial intelligence where a computer/machine learns from the past experiences (input data) and makes future predictions. The performance of such a system should be at least human level.

![https://machinelearningmastery.com/wp-content/uploads/2015/12/Traditional-Programming-vs-Machine-Learning-300x213.png](https://machinelearningmastery.com/wp-content/uploads/2015/12/Traditional-Programming-vs-Machine-Learning-300x213.png)

# 2. Applications of Machine Learning

Sample applications of machine learning:

- Web search: ranking page based on what you are most likely to click on.
- Computational biology: rational design drugs in the computer based on past experiments.
- Finance: decide who to send what credit card offers to. Evaluation of risk on credit offers. How to decide where to invest money.
- E-commerce:  Predicting customer churn. Whether or not a transaction is fraudulent.
- Space exploration: space probes and radio astronomy.
- Robotics: how to handle uncertainty in new environments. Autonomous. Self-driving car.
- Information extraction: Ask questions over databases across the web.
- Social networks: Data on relationships and preferences. Machine learning to extract value from data.
- Debugging: Use in computer science problems like debugging. Labor intensive process. Could suggest where the bug could be.

# 3. Machine Learning Categories.

Machine Learning is generally categorized into three types: Supervised Learning, Unsupervised Learning, Reinforcement learning

## Supervised Learning:
In supervised learning the machine experiences the examples along with the labels or targets for each example. The labels in the data help the algorithm to correlate the features.

Two of the most common supervised machine learning tasks are classification and regression.

- In **classification** problems the machine must learn to predict discrete values. That is, the machine must predict the most probable category, class, or label for new examples. Applications of classification include predicting whether a stock's price will rise or fall, or deciding if a news article belongs to the politics or leisure section. 
- In **regression** problems the machine must predict the value of a continuous response variable. Examples of regression problems include predicting the sales for a new product, or the salary for a job based on its description.

## Unsupervised Learning:
When we have unclassified and unlabeled data, the system attempts to uncover patterns from the data . There is no label or target given for the examples. One common task is to group similar examples together called clustering.

## Semi-supervised learning:
Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). It is a special instance of weak supervision.

## Reinforcement Learning:
Reinforcement learning refers to goal-oriented algorithms, which learn how to attain a complex objective (goal) or maximize along a particular dimension over many steps. This method allows machines and software agents to automatically determine the ideal behavior within a specific context in order to maximize its performance. Simple reward feedback is required for the agent to learn which action is best; this is known as the reinforcement signal. For example, maximize the points won in a game over many moves.

# 4. A Framework For Studying supervised Learning

Terminology used in machine learning:

- **Training example:** a sample from x including its output from the target function
- **Target function:** the mapping function f from x to f(x)
- **Hypothesis:** approximation of f, a candidate function.
- **Concept:** A boolean target function, positive examples and negative examples for the 1/0 class values.
- **Classifier:** Learning program outputs a classifier that can be used to classify.
- **Learner:** Process that creates the classifier.
- **Hypothesis space:** set of possible approximations of f that the algorithm can create.
- **Version space:** subset of the hypothesis space that is consistent with the observed data.

# 5. Commonly used Machine Learning Algorithms
- Linear Regression
- Logistic Regression
- K Nearest Neighbour (KNN)
- K-Means Clustering
- Decision Tree
- Random Forest
- Support Vector Machines (SVM)
- Naïve Bayes

## Linear Regression

Here’s some info about Linear Regression, It is an approach to form a relation between one or more explanatory variables in a linear fashion. Those variables may be dependent or independent. Multiple Linear Regression is used to make predictions based on multiple variables. It assumes a linear relationship between input variable (x) and the output variable (y), Where this y is formed by the linear combination of the input variables.

The equation for linear regression is y = b0+b1*x , Where b0,b1 are changed to adjust the line to get the best possible prediction in order to minimize loss.
As we know that y=mx+c is the equation of a line with slope “m” and intercept “c”. Here slope is the weights and intercept is the bias. So the actual equation can be written as y=wx+b.

![https://miro.medium.com/max/720/0*wB_oQO7aqtCUWu3p.png](https://miro.medium.com/max/720/0*wB_oQO7aqtCUWu3p.png)

## Logistic Regression

- Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables.
- Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
- Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is used for solving Regression problems, whereas Logistic regression is used for solving the classification problems.
- In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1). 
- 1 / (1 + e^-value)
- The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
- Logistic Regression is a significant machine learning algorithm because it has the ability to provide probabilities and classify new data using continuous and discrete datasets.
- Logistic Regression can be used to classify the observations using different types of data and can easily determine the most effective variables used for the classification. The below image is showing the logistic function:

![https://static.javatpoint.com/tutorial/machine-learning/images/logistic-regression-in-machine-learning.png](https://static.javatpoint.com/tutorial/machine-learning/images/logistic-regression-in-machine-learning.png)

## KNN
- K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
- K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.
- K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
- K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
- K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
- It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
- KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.

![https://static.javatpoint.com/tutorial/machine-learning/images/k-nearest-neighbor-algorithm-for-machine-learning2.png](https://static.javatpoint.com/tutorial/machine-learning/images/k-nearest-neighbor-algorithm-for-machine-learning2.png)

## K-Means Clustring
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

The k-means clustering algorithm mainly performs two tasks:

- Determines the best value for K center points or centroids by an iterative process.
- Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

![https://static.javatpoint.com/tutorial/machine-learning/images/k-means-clustering-algorithm-in-machine-learning.png](https://static.javatpoint.com/tutorial/machine-learning/images/k-means-clustering-algorithm-in-machine-learning.png)

## Decision Tree

- Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.
- In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.
- The decisions or the test are performed on the basis of features of the given dataset.
- It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions.
- It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further branches and constructs a tree-like structure.
- In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree algorithm.
- A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.
- Below diagram explains the general structure of a decision tree:

![https://static.javatpoint.com/tutorial/machine-learning/images/decision-tree-classification-algorithm.png](https://static.javatpoint.com/tutorial/machine-learning/images/decision-tree-classification-algorithm.png)

## Random Forest
Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

![https://static.javatpoint.com/tutorial/machine-learning/images/random-forest-algorithm.png](https://static.javatpoint.com/tutorial/machine-learning/images/random-forest-algorithm.png)

## SVM
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane:

![https://static.javatpoint.com/tutorial/machine-learning/images/support-vector-machine-algorithm.png](https://static.javatpoint.com/tutorial/machine-learning/images/support-vector-machine-algorithm.png)

## Naïve Bayes
- Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.
- It is mainly used in text classification that includes a high-dimensional training dataset.
- Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.
- It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
- Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

- **Naïve:** It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.
- **Bayes:** It is called Bayes because it depends on the principle of Bayes' Theorem.

![https://static.javatpoint.com/tutorial/machine-learning/images/naive-bayes-classifier-algorithm.png](https://static.javatpoint.com/tutorial/machine-learning/images/naive-bayes-classifier-algorithm.png)

# 6. Train Test Spliting

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

- **Train Dataset:** Used to fit the machine learning model.
- **Test Dataset:** Used to evaluate the fit machine learning model.

**The train-test procedure is appropriate when there is a sufficiently large dataset available.**

Large enough to suitable representations of the problem domain.

A suitable representation of the problem domain means that there are enough records to cover all common cases and most uncommon cases in the domain. This might mean combinations of input variables observed in practice. It might require thousands, hundreds of thousands, or millions of examples.

**In addition to dataset size, another reason to use the train-test split evaluation procedure is computational efficiency.**

The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets. For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage 0.33 (33 percent) is assigned to the test set.

There is no optimal split percentage.

You must choose a split percentage that meets your project’s objectives with considerations that include:

- Computational cost in training the model.
- Computational cost in evaluating the model.
- Training set representativeness.
- Test set representativeness.

Nevertheless, common split percentages include:

- Train: 80%, Test: 20%
- Train: 70%, Test: 30%
- Train: 67%, Test: 33%
- Train: 50%, Test: 50%


## Train-Test Split Procedure in Scikit-Learn

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [35]:
data = pd.read_csv('ABNB_NYC_2019.csv')

In [36]:
len(data)

48895

In [47]:
X = data[['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365']].values
y = data['price'].values

In [48]:
X.shape, y.shape

((48895, 15), (48895,))

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.20)

In [55]:
X_train[0]

array([5010368, 'Upper East Side Gem!', 25840204, 'Daniel', 'Manhattan',
       'Upper East Side', 40.77546, -73.95165, 'Entire home/apt', 3, 75,
       '2019-04-27', 1.44, 1, 234], dtype=object)

In [56]:
X_train.shape, y_train.shape

((39116, 15), (39116,))

In [57]:
X_test.shape, y_test.shape

((9779, 15), (9779,))

## Repeatable Train-Test Splits
Another important consideration is that rows are assigned to the train and test sets randomly.

This is done to ensure that datasets are a representative sample (e.g. random sample) of the original dataset, which in turn, should be a representative sample of observations from the problem domain.

When comparing machine learning algorithms, it is desirable (perhaps required) that they are fit and evaluated on the same subsets of the dataset.

This can be achieved by fixing the seed for the pseudo-random number generator used when splitting the dataset.

In [61]:
X_train[0]

array([32015007, 'Cozy room next to subway!', 240033083, 'Patrik',
       'Manhattan', 'Harlem', 40.82272, -73.95478, 'Private room', 2, 0,
       nan, nan, 1, 0], dtype=object)

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.20, random_state=42)

In [67]:
X_train[0]

array([25674366, 'Mid Century Modern Williamsburg Condo', 9038810,
       'Sanjay', 'Brooklyn', 'Williamsburg', 40.71577, -73.9553,
       'Entire home/apt', 3, 11, '2019-05-16', 0.87, 1, 1], dtype=object)

# Cross Validation using K-folds

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

![https://www.researchgate.net/profile/Fabian-Pedregosa/publication/278826818/figure/fig10/AS:614336141750297@1523480558954/The-technique-of-KFold-cross-validation-illustrated-here-for-the-case-K-4-involves.png](https://www.researchgate.net/profile/Fabian-Pedregosa/publication/278826818/figure/fig10/AS:614336141750297@1523480558954/The-technique-of-KFold-cross-validation-illustrated-here-for-the-case-K-4-involves.png)

![https://scikit-learn.org/stable/_images/grid_search_cross_validation.png](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

The general procedure is as follows:

1. Shuffle the dataset randomly.
2. Split the dataset into k groups
3. For each unique group:
    1. Take the group as a hold out or test data set
    2. Take the remaining groups as a training data set
    3. Fit a model on the training set and evaluate it on the test set
    4. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.

## Configuration of k

The k value must be chosen carefully for your data sample.

A poorly chosen value for k may result in a mis-representative idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model).

Three common tactics for choosing a value for k are as follows:

- Representative: The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
- k=10: The value for k is fixed to 10, a value that has been found through experimentation to generally result in a model skill estimate with low bias a modest variance.
- k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an opportunity to be used in the hold out dataset. This approach is called leave-one-out cross-validation.


## Example

In [92]:
import pandas as pd
from sklearn.model_selection import KFold

In [96]:
data = pd.read_csv('ABNB_NYC_2019.csv')

In [97]:
kf = KFold(n_splits=10, shuffle=False)

In [109]:
for train, test in kf.split(data):
    train_data = data.iloc[train]
    test_data = data.iloc[test]
    print(train, test)
    print(train_data.shape, test_data.shape)
    # train model here
    print('\n')

[ 4890  4891  4892 ... 48892 48893 48894] [   0    1    2 ... 4887 4888 4889]
(44005, 16) (4890, 16)


[    0     1     2 ... 48892 48893 48894] [4890 4891 4892 ... 9777 9778 9779]
(44005, 16) (4890, 16)


[    0     1     2 ... 48892 48893 48894] [ 9780  9781  9782 ... 14667 14668 14669]
(44005, 16) (4890, 16)


[    0     1     2 ... 48892 48893 48894] [14670 14671 14672 ... 19557 19558 19559]
(44005, 16) (4890, 16)


[    0     1     2 ... 48892 48893 48894] [19560 19561 19562 ... 24447 24448 24449]
(44005, 16) (4890, 16)


[    0     1     2 ... 48892 48893 48894] [24450 24451 24452 ... 29336 29337 29338]
(44006, 16) (4889, 16)


[    0     1     2 ... 48892 48893 48894] [29339 29340 29341 ... 34225 34226 34227]
(44006, 16) (4889, 16)


[    0     1     2 ... 48892 48893 48894] [34228 34229 34230 ... 39114 39115 39116]
(44006, 16) (4889, 16)


[    0     1     2 ... 48892 48893 48894] [39117 39118 39119 ... 44003 44004 44005]
(44006, 16) (4889, 16)


[    0     1     2 ... 44003 44

# Overfitting vs Underfitting

Understanding model fit is important for understanding the root cause for poor model accuracy. This understanding will guide you to take corrective steps. We can determine whether a predictive model is underfitting or overfitting the training data by looking at the prediction error on the training data and the evaluation data.



![https://docs.aws.amazon.com/machine-learning/latest/dg/images/mlconcepts_image5.png](https://docs.aws.amazon.com/machine-learning/latest/dg/images/mlconcepts_image5.png)

Your model is **underfitting** the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).

Your model is **overfitting** your training data when you see that the model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.

Poor performance on the training data could be because the model is too simple (the input features are not expressive enough) to describe the target well. Performance can be improved by increasing model flexibility. To increase model flexibility, try the following:

- Add new domain-specific features and more feature Cartesian products, and change the types of feature processing used (e.g., increasing n-grams size)

- Decrease the amount of regularization used

If your model is overfitting the training data, it makes sense to take actions that reduce model flexibility. To reduce model flexibility, try the following:

- Feature selection: consider using fewer feature combinations, decrease n-grams size, and decrease the number of numeric attribute bins.

- Increase the amount of regularization used.

Accuracy on training and test data could be poor because the learning algorithm did not have enough data to learn from. You could improve performance by doing the following:

- Increase the amount of training data examples.

- Increase the number of passes on the existing training data.