# What is machine learning?


Giving the machine the ability to learn a pattern and use the pattern to identify them later on unseen data.

**NLP is not only machine learning and vice versa!.**
![](https://lh3.googleusercontent.com/RQ1bMI_oo1K1ikB45ufBU72BCLhvCoDKww0xgDi45A93Ad9iT39m48qf9_7WlWDnrUrzF0wh_SQp_vBlnTzRYqgGpJXiL4YOY5ZmGyiTQfA_hWTtqIWxjFGrkGywz8cAPNbVxO1J)

## Machine learning algorithms:

- Rule based machine learning
- Statistical machine learning






From the perspective of data:

- Supervised machine learning algorithms

- Unsupervised machine learning algorithms


NLP tasks which benefit from machine learning algorithms:

- Sentiment Analysis
- Machine Translation
- Text Classification
- Information retrieval
- ...

Most NLP tasks nowadays benefit from machine learning approches to improve the quality of their results.

## Supervised machine learning algorithms:

A supervised machine learning algorithm learning patterns from a data which we already know the output for.

One of the general supervised machine learning algorithms is called classification, in which an already labeled dataset is used to learn a generalized pattern , and then using this pattern the algorithm assigns new items that it hasn't seen before to the classes.

**Classification** is when the output of the algorithm is discrete.


- Example:  Spam detection   (input : an email , output : spam - not spam)

**Regression** is another general type of supervised machine learning algorithm, unlike classification in regression the range of the output data is continuous.

So the output is not limited to a number of labels.

- Example: Price estimation (input: product, output : price: 0\$ - n\$)



### Supervised machine learning algorithms:
    - Naive Bayes classifier
    - Support vector machines
    - Linear regression
    - Neural networks 
    


## Unsupervised machine learning algorithms:

There is no labeled dataset. The algorihtms learns patterns based on the similarity of the data. One of the general algorithms of unsupervised machine learning is called clustring.

Example: Topic Modeling (we will see this later)


-----

In this session we focus on the general framework of a supervised machine learning algorithm and use one of the mentioned algorithms for a text-classification task. 

## Train - Test - Validation sets:

In order to evaluate the machine learning algorithm, a framework using supervised machine learning algorithm usually splits a labeled-data set to 3 sets.


The **train** set is used by the algorithm to learn and generalize patterns from data.

The **validation** set is used by the algorithm to evaluate the algorithm while being trained. (may or may not be used in a framework)

The **test** set is used for evaludating the algorithm after the training is done and the model is not changing anymore.


## Preprocessing


Data is messy!

We usually need to apply some preprocessing to the data before learning patterns from it.

![](https://i.pinimg.com/564x/db/4f/88/db4f88f155d22599f59765e14f4c5497.jpg)


**Research and Reflect:** Do you remember what preprocessing methods we already talked about?  

### write your answer here

---


## Feature extraction

Feature extraction step means to extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.

Remeber the one-hot coding representation? 



A simple but effective and prominently used feature representation method in NLP is called **Bag of Words**

In this method the input text is looked at without considering the order in which the tokens occur. and only counts the frequency of the occurence of a token.


![](https://miro.medium.com/max/764/1*MeSYCKGDOdwkJKVZKxJuvg.png)

**Code it** We have already seen an nltk function which gets a text and returns the bag of words representation of the text, Use it to extract the bag of word representation of the [this story by Edgar Allen Poe](https://www.gutenberg.org/files/1064/1064-0.txt).



In [None]:
#insert your code here

----

Lets gets our hands on training and testing a simple supervised machine learning text classification algorithm based using bag-of-words.

Download the data set of consumer complaints from the following link.
https://www.kaggle.com/dushyantv/consumer_complaints

Load the dataset into memory using the following code:

In [None]:
# if not yet installed install libraries:

!pip install numpy
!pip install pandas
!pip install matplotlib

We reduce the data to contain products and the text of the consumer complaint.

Then we add a column which maps a number to each of the product categories calling it category_id

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory


import pandas as pd
df = pd.read_csv('Consumer_Complaints.csv')

# Remove rows at random to avoid memory error 
np.random.seed(10)
remove_n = 1000000
drop_indices = np.random.choice(df.index, remove_n, replace=True)
df = df.drop(drop_indices)
df.shape
# Remove above code after mem optimization

# We need “Product”(OUT) and “Consumer complaint narrative”(IN) columns.


from io import StringIO
col = ['Product', 'Consumer Complaint']
df = df[col]
df = df[pd.notnull(df['Consumer Complaint'])]
df.columns = ['Product', 'Consumer_Complaint']
df['category_id'] = df['Product'].factorize()[0]
category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Product']].values)

df = df[0:5000]
df.to_csv("Complaints.csv")
display(df)
df.head()


In [None]:
list_text

Running the following code you can see the number of consumer complaints on each of the products.


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Complaints.csv")
fig = plt.figure(figsize=(8,6))
df.groupby('Product').Consumer_Complaint.count().plot.bar(ylim=0)
plt.show()

# We see here imbalance of classes
# We want a classifier that gives high prediction accuracy over the majority class,
# while maintaining reasonable accuracy for the minority classes as the majority classes might be of use

### TF-IDF

Instead of using the counts of a word in a document we use another feature which is based on the frequency of tokens in a document called Tf-IDF.
The Term-Frequency Inverse Document Frequency measures the frequency of a token with regard to the document. It is intended to measure how important a token or word is in a document and it is computed by:

![](http://www.digitalmarketingchef.org/wp-content/uploads/2020/02/TFIDF_FORMULA.png)

**Reflect and Reply** Considering the formula above for tf-idf measure, Do you expect the tf-idf of stopwords to be relatively high or low compared to other words in a document?

Why?

### write your answer here

---

The following code uses a library called sklearn to extract and numerize the tf-idf of words from text.

The code prints the dimention of the feature vector.

** Observe and Reflect **   Can you explain the dimentions of the output? what does each dimention of the output vector represent?

### write your answer here


In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Consumer_Complaint).toarray()
labels = df.category_id
features.shape

features

In [None]:
!pip install seaborn

THe following code uses a supervised machine learning algorithm called [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)(support vector machine) to train a model which is able to identify from the consumer complaint as input the category of the product.

We don't go into further detail of the algorithm.

1- The code uses the tf-idf features extracted in the previous step from the data.

2- Then splits the data to train and test set.

3- Trains a model based on train set 

4- uses the train model to identify the category of the test samples. 

In [None]:
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=category_id_df.Product.values, yticklabels=category_id_df.Product.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

**Reflect and Reply:**  Can you map parts of the code with each of the steps 1,2,3,4 of the supervised machine learning framework in the above?( for instance Which line of code does task 1? The code uses the tf-idf features extracted in the previous step from the data)

## write your answer here

**Reflect and Reply:** according to the heap map which categories where better identified using the above machine learning algorithm?



## write your answer here

-----


**Homework exercise:**
 How do we evaluate supervised machine learning algorithms? Do a research on measures such as : **Precision**, **Recall** and **Accuracy** 

Explain the difference between them (you may refer to the sklearn documentation): 

https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

**CODE IT**
evaluate the above method for the product categorization based on consumer complaints, with precision, recall and accuracy measures from sklearn library.


## write your answer here

In [None]:
#insert your code here

### References

https://www.kaggle.com/anucool007/multi-class-text-classification-bag-of-words