# Machine learning I
## What is Machiene learning?"
### Some definitions
* "Field of study that gives computers the ability to learn without being explicitly programmed" Arthur Samuel(1959)

<img src="https://upload.wikimedia.org/wikipedia/commons/3/30/International_draughts.jpg" alt="checkers" width="200px"/>

* "A computer program is said to learnfrom experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E." Tom Mitchell (1998)

### Categories of Machine learning

> Indented block




![title](https://i.vas3k.ru/7w1.jpg)


#### Supervised learning

![alt](https://raw.githubusercontent.com/gesiscss/WDCNLP/main/data/spam-filters.png)

Supervised learning referes to modeling the relationship between measured features of data and some label associated with the data; once this model is determined, it can be used to apply labels to new, unknown data. Further subcategoris are classification tasks and regression tasks: in classification, the labels are discrete categories, while in regression, the labels are continuous quantities. 

#### Unsupervised learning

<img src="https://raw.githubusercontent.com/gesiscss/WDCNLP/main/data/network.png" alt="checkers" width="200px"/>

Example: Networks clustering


Unsupervised learning referes to modeling the features of a dataset without reference to any label. These models include tasks such as clustering or dimensionality reduction. Clustering algorithms identify distinct groups of data, while dimensionality reduction algorithms search for more succinct representations of the data.
* Others are semi-supervised learning methods, reinforcement learning, recommender system, ...

## Supervised learning (Classification)
We will first take a look at a simple classification task, in which you are given a set of labeled points and want to use these to classify some unlabeled points.

$$\mathbf{Data} = \begin{bmatrix}
    \textbf{feature 1} & \textbf{feature 2} & \textbf{label}  \\
    x_{1}^{(2)} & x_{2}^{(2)} & red \\
    x_{1}^{(2)} & x_{2}^{(2)} & red \\
        \vdots & \vdots & \vdots \\
    x_{1}^{(150)} & x_{2}^{(150)} & blue
\end{bmatrix}.
$$

![alt](https://jakevdp.github.io/PythonDataScienceHandbook/figures/05.01-classification-1.png)

This data set represents points in a 2-dimensional plane. Furthermore, each point is associated with one of two possible class labels ("red" or "blue").

We will create a model that will let us decide whether to which of the two classes point belongs and assume that the two classes can be separated by drawing a straight line through between them.

Here we have two-dimensional data: that is, we have two features for each point, represented by the (x,y) positions of the points on the plane. In addition, we have one of two class labels for each point, here represented by the colors of the points. From these features and labels, we would like to create a model that will let us decide whether a new point should be labeled "blue" or "red."

![alt](https://jakevdp.github.io/PythonDataScienceHandbook/figures/05.01-classification-2.png)

With this model we can now *predict* the classes of new unseen data.

![alt](https://jakevdp.github.io/PythonDataScienceHandbook/figures/05.01-classification-3.png)




## Introducing Scikit-Learn

To make things less abstract let us begin with by using the Iris data set. This is a classic example dataset from statistics ([see](https://en.wikipedia.org/wiki/Iris_flower_data_set))

Iris Setosa | Iris Versicolor  | Iris Virginica
- | -  | - 
![alt](http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg) | ![alt](http://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Iris_versicolor_3.jpg/1920px-Iris_versicolor_3.jpg) | ![alt](http://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Iris_virginica.jpg/1920px-Iris_virginica.jpg)

We will represent each induvidual flower sample as one row in our DataFrame, and the columns (features) represent the flower measurements in centimeters. We can represent the Iris dataset, consisting of 150 samples and 4 features, a 2-dimensional array or matrix $\mathbb{R}^{150 \times 4}$ in the following format:


$$\mathbf{Data} = \begin{bmatrix}
    \textbf{feature 1} & \textbf{feature 2} & \textbf{feature 3} & \dots  & \textbf{label} \\
    x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \dots  & y^{(1)} \\
    x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \dots  & y^{(2)} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{1}^{(150)} & x_{2}^{(150)} & x_{3}^{(150)} & \dots  & y^{(150)}
\end{bmatrix}.
$$

### Features


In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/gesiscss/WDCNLP/main/data/iris.csv", na_values="?")
df.head()

Let us do some basic inspection of our class label

In [None]:
df.species.value_counts()

inspect the dimensionality of the *DataFrame*

In [None]:
df.shape

and visualize the class with respect to two features *"sepal_length"* and *"sepal_width"*

In [None]:
df2 = df.copy()
df2 ["color"] = df2.species.replace(["versicolor", "virginica", "setosa"], ["red", "blue", "green"])
df2.plot.scatter("sepal_length", "sepal_width", c=df2["color"])

### Preprocessing the data

Scikit learn is not based on [Pandas DataFrame](https://www.geeksforgeeks.org/python-pandas-dataframe/), but on numpy arrays. NumPy is the package that underlies also Pandas, and we can get these arrays easily from a pandas frame.

In [None]:
val = df.values
print (type(val))
print (val[1:5])

To make a classification, we need separate arrays for the describing features, and the class label:

In [None]:
labels = df['species'].values
labels[1:5]

Classifiers in this package expect the classes to be integers, so we need to transform the strings into integers. Fortunately, there are convenience functions available to do this.

In [None]:
from sklearn import preprocessing
labels = preprocessing.LabelEncoder().fit_transform(labels)
labels[1:5]

In [None]:
features = df.drop("species", axis=1).values
features[1:5]

### Training the Model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
clf = KNeighborsClassifier()
clf.fit(features, labels)

In [None]:
predicted = clf.predict(features)
predicted[1:5]

### Evaluation
For evaluating our model we test which instances have been predicted correctly

In [None]:
labels == predicted

The number of datapoints wich are correctly predicted is

In [None]:
sum(labels==predicted)

In [None]:
sum(labels==predicted) / len (labels)

### Splitting the Data into training and test set

In [None]:
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.33, random_state=42)

In [None]:
print (len (features_train))
print (len (features_test))
print (len (labels_train))
print (len (labels_test))

We then fit the classifier on the training data:

In [None]:
clf = KNeighborsClassifier()
clf.fit (features_train, labels_train)

and use it to predict the test data

In [None]:
predicted_test = clf.predict (features_test)
predicted_test

The accuracy on the test set is then

In [None]:
sum(labels_test==predicted_test) / len (labels_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(labels_test, predicted_test)


sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
We can extend this quivalently to other [evaluation measures](https://en.wikipedia.org/wiki/Precision_and_recall):

In [None]:
from sklearn.metrics import precision_recall_fscore_support
precision, recall, fscore, support = precision_recall_fscore_support(labels_test, predicted_test, labels=[1])
print ("Precision: ", precision) # if label 1 is predicted, how often is it really label 1
print ("Recall: ", recall) # How likely is the prediction of an instance with label 1 really label 1
print ("F_score: ", fscore) # harmonic mean of precision and recall
print ("support: ", support) # how often does this label occur

### K-fold cross validation
We will now randomly partition the data into k equal sized subsamples. We then retain a single subsample for testing our model and use the remaining k − 1 subsamples as training dat



In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=None, shuffle=True)
for train_index, test_index in kf.split(features):
    #print (test_index)
    features_train = features[train_index]
    labels_train = labels[train_index]
    features_test = features [test_index]
    labels_test = labels [test_index]
    
    clf.fit(features_train, labels_train)
    predicted_test = clf.predict (features_test)
    print(sum(labels_test==predicted_test) / len (labels_test))

Sklearn provides the shortcut

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, features, labels, cv=20, scoring="accuracy")
print (scores)
print ("Accuracy according to cross-validation: ", scores.mean())

## Further learning resources
* VanderPlas, Jake. *Python data science handbook: essential tools for working with data*. " O'Reilly Media, Inc.", 2016.
* Gramfort, Alex and Mueller, Andreas *Scipy 2017 scikit-learn tutorial* "SciPy", 2017.
* Guido, Sarah and Mueller, Andreas *Introduction to Machine Learning with Python: A Guide for Data Scientists* ([link](https://github.com/amueller/introduction_to_ml_with_python))

#### Used resources
* Wikipedia.org
* Python Data Science Handbook
* Kaggle
* Google
* A review on machine learning: trends and future prospects by Manish Kumar Aery and Chet Ram