### NLP with Machine Learning

In this chapter, we discuss tasks that can be done using traditional NLP methods, including rules-based and supervised and unsupervised machine learning techniques.

What is machine learning (ML)?

In simple terms, we are trying to teach machines how to learn like humans.

Defi: ML algorithms are used enable computers to learn and make decisions from data.

The algorithms fall under two main categories:

- supervised Learning: Use historical data to predict the future.
  Examples: (Numeric) What will house prices look like for the next 12 months? (Text) How can I flag a suspicious email as spam?
- Unsupervised Learning: Finding patterns and relationships in data.
  Examples: (Numeric) How can I segment my customers? (Text) What hidden themes are in these product reviews?

Common Algorithms:

There are multiple machine learning algorithms. We can use these common ML algorithms for natural language processing once we preprocessed text data.
                        
Supervised Learning - Regression (Linear, Regularized, Time Series Analysis) & Classifications (Logistic, Decision Trees, Random Forest, Gradient Boosted Trees, Naive Bayes)
Unsupervised Learning (DBSCAN, Hierarchical Clustering, Principal Component Analysis, Non-Negative Matrix Factorization)

Traditional NLP:

Common NLP tasks are aften solve using traditional NLP methods, such as simple rules-based techniques or more advanced ML algorithms.

NLP Tasks we will be covering:

- Sentimental Analysis: Identifying the positivity or negativity of text (Technique: Rules-based, Library: VADER, Input format: Raw text (because order matters))
- Text Classification: Classifying text as one label or another (Technique: Supervised Learning (Naive Bayes), Library: scikit-learn, Input format: CV/TF-IDF)
- Topic Modeling: Finding themes within a corpus of text (that is, many text documents) (Techniques: Unsupervised learning, Library: scikit-learn, Input format: CV/TF-IDF)

Traditional vs Modern NLP:

When should I use traditional/modern NLP techniques?

Note that traditional NLP involves machine learning techniques, and modern NLP involves deep learning techniques. If we have an option to choose from, it is recommended to start simple. We can ask the following questions to determine which one to choose.

- What is my NLP goal?
  If my goal is sentiment analysis/text classification/topic modelling, these can be performed with traditional techniques. If my goal is text/generation/machine translation/question answering, the traditional techniques may not be sufficient. 
- How much data do I have?
  If I have small data, I can use traditional techniques, and if I have big data, modern techniques can be used.

Similar to Chapter 2, before moving forward, we will create a new environment called 'nlp_machine_learning' and install the following packages:

- jupyter
- matplotlib
- notebook
- numpy
- openpyxl
- pandas
- python
- scikit-learn
- spacy

We also want to install the package: `vaderSentiment`. If we run the usual command to install a package, we will get an error. This is because it is not available in the  default Anaconda channel. We can install it using an alternative channel. It is available in the 'conda-forge' channel. This channel is maintained by the community.

#### Sentiment Analysis

This is used to determine the positivity or negativity of text. A score between +1 and -1 will be given to each block of text.

Note: This will be applied to raw text.

This can be done with libraries such as `VADER`, classification techniques, or modern NLP techniques. Here we will be using `VADER` (Valence Aware Dictionary and sEntiment Reasoner). This works well with informal text (social media text, online reviews).

Note: Steps with different libraries are mostly similar.

Step 1: Import `SentimentIntesityAnalyzer`.

Step 2: Identify the corresponding text.

Step 3: Create a new `SentimentIntesityAnalyzer` object.

Step 4: Obtain polarity scores.

In the output, the important score is the `compound` score. It tells about the positivity/negativity. This is calculated using a series of rules. First, VADER assigns predefined sentiment weights to words (amazing = 2.8, horrible = 2.5). Then incorporate modifiers (not, very, caps, punctuation, ...) and compute a final score.

In [1]:
import pandas as pd

# create a list of sentences
data = [
    "When life gives you lemons, make lemonade! ðŸ™‚",
    "She bought 2 lemons for $1 at Maven Market.",
    "A dozen lemons will make a gallon of lemonade. [AllRecipes]",
    "lemon, lemon, lemons, lemon, lemon, lemons",
    "He's running to the market to get a lemon â€” there's a great sale today.",
    "iced tea is my favorite",
    "I didn't like the taste of that lemonade at all.",
    "My lemons went bad before I could use them, unfortunately.",
]

# expand the column width to see the full sentences
pd.set_option('display.max_colwidth', None)

# turn it into a dataframe
data_df = pd.DataFrame(data, columns=["sentence"])
data_df.head()

# make a copy of the dataframe
df = data_df.copy()
df.head()

Unnamed: 0,sentence
0,"When life gives you lemons, make lemonade! ðŸ™‚"
1,She bought 2 lemons for $1 at Maven Market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon â€” there's a great sale today.


In [2]:
### Import the VADER library

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [4]:
### First, we will test the code with the first sentence.

test = df['sentence'][0]
test

'When life gives you lemons, make lemonade! ðŸ™‚'

In [6]:
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(test)

{'neg': 0.0, 'neu': 0.75, 'pos': 0.25, 'compound': 0.4587}

The above output tell us which percentage of the sentence is negative/neutral/positive and final sentinal score.

In [7]:
analyzer.polarity_scores(test)['compound']

0.4587

In [8]:
### Now we make it function and apply it to the entire column.

def get_sentiment(text):
    analyzer = SentimentIntensityAnalyzer()
    return analyzer.polarity_scores(text)['compound']

In [9]:
df['sentence'].apply(get_sentiment)

0    0.4587
1    0.0000
2    0.0000
3    0.0000
4    0.6249
5    0.4588
6   -0.2755
7   -0.7096
Name: sentence, dtype: float64

In [10]:
df['sentiment'] = df['sentence'].apply(get_sentiment)
df

Unnamed: 0,sentence,sentiment
0,"When life gives you lemons, make lemonade! ðŸ™‚",0.4587
1,She bought 2 lemons for $1 at Maven Market.,0.0
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],0.0
3,"lemon, lemon, lemons, lemon, lemon, lemons",0.0
4,He's running to the market to get a lemon â€” there's a great sale today.,0.6249
5,iced tea is my favorite,0.4588
6,I didn't like the taste of that lemonade at all.,-0.2755
7,"My lemons went bad before I could use them, unfortunately.",-0.7096
