<a href='https://www.moh.gov.sa/en/Pages/Default.aspx'> <img style="float: left;height:70px" src="http://scienceacademy.ca/wp-content/uploads/2018/12/Logo_SA.png"></a>

Instructor and author: [_Dr. Junaid Qazi_](https://www.linkedin.com/in/jqazi/)

<img src="https://snag.gy/uvESGH.jpg" alt="drawing" width="400"/>

$$
\begin{eqnarray*}
\textbf{Fun Fact:  } \text{Word Clouds} &\neq& \text{Data Science}
\end{eqnarray*}
$$

[Free Word Cloud Generator](https://www.wordclouds.com)

# NLP Lab

### Objectives
* Extract features from unstructured text
* Describe how `CountVectorizers` and `TF-IDFVectorizers` work.
* Implement `CountVectorizer` in a spam classification model.
* `GridSearch` and so on .......!

In [3]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#Setting display format to retina in matplotlib to see better quality images.
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')

# Just don't want warnings.
import warnings
warnings.filterwarnings('ignore')

  set_matplotlib_formats('retina')


## Introduction to Text Feature Extraction

The models we've learned, like linear regression, logistic regression, and k-nearest neighbors, take in an `X` and a `y` variable.
- `X` is a matrix/dataframe of real numbers.
- `y` is a vector/series of real numbers.

Text data (also called natural language data) is not already organized as a matrix or vector of real numbers. We say that this data is **unstructured**. We already know how to transform our unstructured text data into a numeric `X` matrix!

## Spam Classification Model

One of the most common applications of NLP is predicting `"spam"/"ham,"` or `"spam"/"not spam"`.

***What do you think, can we predict `real vs. promotional` texts just based on what is written?***

> This data set was taken from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [68]:
# Read in data.
data_url="""https://raw.githubusercontent.com/junaidqazi/DataSets_Practice\
_ScienceAcademy/master/SMS_Spam_Collection/SMSSpamCollection"""

df = pd.read_csv(data_url,sep='\t',names=['label', 'message'])

In [69]:
# Check out first five rows.
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Basic terminology

---

Virtually all NLP uses this base terminology *(Recall the lecture)*:

- A collection of text is a **document**. 
    - You can think of a document as a row in your feature matrix.
- A collection of documents is a **corpus**. 
    - You can think of your full dataframe as the corpus.

<details><summary>In this specific example, what is a document, and how many documents you have in the corpus?</summary>
    
- Each text message in our data set is one document. 
- There are 5,572 documents in our corpus.
</details>

## Model prep
---

Convert `ham/spam` into binary labels: *Try using map with a dictionary for ham:0 and spam:1*. It's your choice!
- 0 for ham
- 1 for spam

In [70]:
# code here please
df['label'] = df['label'].apply(lambda x: 1 if x == 'spam' else 0)
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Let's set up our data for modeling:
- `X` will be the `message` column. 
- `y` will be the `label` column

**NOTE**: `CountVectorizer` requires a vector, so make sure you set `X` to be a `pandas` Series, **not** a DataFrame.

In [71]:
X = df['message']
y = df['label']

**How many messages are `spam` and how many are `ham`?**

In [72]:
# code here please
df['label'].value_counts()

0    4825
1     747
Name: label, dtype: int64

In [8]:
# Check what we need to check.


0    4825
1     747
Name: label, dtype: int64

**Split the data into the training and testing sets (you can use default parameters).** *recall on parameter [stratify, why to use this?](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)*

In [9]:
# code here please

In [73]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2, random_state=42)

## Scikit-Learn `CountVectorizer`
---

`sklearn` offers a `CountVectorizer` class with many configurable options.`<shift+tab>` for documentation.<br>
Create an instance with `default CountVectorizer`, then get into the various hyperparameters of the class.

<details><summary>Remind me: what is a hyperparameter?</summary>

- A hyperparameter is a built-in option that affects our model, but our model cannot learn these from our data!
- Examples of hyperparameters include:
    - the value of $k$ and the distance metric in $k$-nearest neighbors,
    - our regularization constants $\alpha$ or $C$ in linear and logistic regression.
</details>

**Do the following:**
* Instantiate a `CountVectorizer`.
* Fit the vectorizer on our corpus.
* Transform the corpus.

In [11]:
# code here please

In [74]:
# Instantiate a CountVectorizer.
count_vect = CountVectorizer()

In [75]:
# Fit the vectorizer on our corpus -- X_train.
count_vect.fit(X_train)

CountVectorizer()

In [76]:
# Transform the corpus -- X_train.
X_counts_transformed = count_vect.transform(X_train)

**Create a dataframe of your transformed data and see how it look like.**

In [77]:
# code here please
X_train_df = pd.DataFrame(X_counts_transformed.toarray(), columns= count_vect.get_feature_names())
X_train_df.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,02,0207,...,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Convert X_train into a DataFrame.


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# for X_train


In [78]:
# Transform test -- X_test
X_test_count = count_vect.transform(X_test)
X_test_df = pd.DataFrame(X_counts_transformed.toarray(), columns= count_vect.get_feature_names())
X_test_df.head(2)

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,02,0207,...,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [79]:
# X_test
X_test_df.head(2)

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,02,0207,...,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [80]:
#X_test
X_test

3245    Squeeeeeze!! This is christmas hug.. If u lik ...
944     And also I've sorta blown him off a couple tim...
1044    Mmm thats better now i got a roast down me! i...
2484        Mm have some kanji dont eat anything heavy ok
812     So there's a ring that comes with the guys cos...
                              ...                        
4264    Den only weekdays got special price... Haiz......
2439        I not busy juz dun wan 2 go so early.. Hee.. 
5556    Yes i have. So that's why u texted. Pshew...mi...
4205    How are you enjoying this semester? Take care ...
4293                                                G.W.R
Name: message, Length: 1115, dtype: object

## Stop Words 
Some words are so common that they may not provide legitimate information about the $Y$ variable we're trying to predict. This may or may not be the case!

In [81]:
# this one is from nltk!
from nltk.corpus import stopwords
print(len(stopwords.words('english')))
print(stopwords.words('english'))

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

`CountVectorizer` gives you the option to eliminate stop words from your corpus.

```python
cvec = CountVectorizer(stop_words='english')
```

You can optionally pass your own list of stop words that you'd like to remove.
```python
cvec = CountVectorizer(stop_words=['list', 'of', 'words', 'to', 'stop'])
```

## Bag of Words/Word Counting

---

When we have unstructured text data, there is a lot of information in that text data.
For example, `CountVectorizer` creates a vector for each token and counts up the number of occurrences of each token in each document.

In [82]:
X_train_df.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,02,0207,...,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Notice that the order of the words in the original document no longer matters!
- The **bag-of-words model** is a simplified representation of the raw data. 
- In this model, a document is represented as the bag/multiset of its words.

**Bag-of-words representations discard grammar, order, and structure in the text but track occurrences.**

<details><summary>What might be some of the advantages of the bag-of-words approach?</summary>

- Efficient to store.
- Efficient to model.
- Keeps a decent amount of information.
</details>

<details><summary>What might be some of the disadvantages of the bag-of-words approach?</summary>

- Since bag-of-words models discard grammar, order, structure, and context, we lose a decent amount of information.
- Phrases like "not bad" or "not good" won't be interpreted properly.
</details>

## N-Gram Range
---

`CountVectorizer` has the ability to capture **$n$-word phrases, also called $n$-grams**. Consider the following:

> The quick brown fox jumped over the lazy dog.

In the example sentence, the 2-grams (aka bi-grams) are:
- 'the quick'
- 'quick brown'
- 'brown fox'
- 'fox jumped'
- 'jumped over'
- 'over the'
- 'the lazy'
- 'lazy dog'

The `ngram_range` determines what $n$-grams should be considered as features.

```python
cvec = CountVectorizer(ngram_range=(1,2)) # Captures every single word and every 2-gram
```

**Try the code below**

```Python
phrase=["The quick brown fox jumped over the lazy dog. not good,  not bad"]
for tup in [(1,1),(1,2),(1,3)]:#,(1,5)]:
    vect=CountVectorizer(analyzer = "word",ngram_range=tup)
    features=vect.fit_transform(phrase)
    print(features.shape)
    print(vect.get_feature_names(),'\n\n')
```

<details><summary>How many 3-grams (tri-grams) would be generated from the phrase "The Cat In The Hat?"</summary>

- Three 3-grams.
    - "The Cat In"
    - "Cat In The"
    - "In The Hat"
</details>

**Try the above code for "The Cat In The Hat?", how many tri-grams you are getting?**

In [83]:
#Code here please
phrase = ["The Cat In The Hat?"]
for tup in [(1,3)]:
    vect = CountVectorizer(analyzer = 'word',ngram_range=tup)
    features = vect.fit_transform(phrase)
    print(features.shape)
    print(vect.get_feature_names(),'\n\n')


(1, 11)
['cat', 'cat in', 'cat in the', 'hat', 'in', 'in the', 'in the hat', 'the', 'the cat', 'the cat in', 'the hat'] 




<details><summary>Why might we want to change n_gram range to something other than (1,1)?</summary>

- We can work with multi-word phrases like "not good" or "very hot."
</details>

## Modeling

---

We may want to test lots of different values of hyperparameters in our CountVectorizer.

<details><summary>What two things will we use to do this?</summary>
    
- GridSearch
- Pipelines
</details>

In [159]:
# Redefine training and testing sets.
cvec = CountVectorizer(stop_words='english')
cvec2 = CountVectorizer(stop_words='english')
cvec_ngram = CountVectorizer(ngram_range=(1,2))
cvec_ngram2 = CountVectorizer(ngram_range=(1,2))

X_train_SW = cvec.fit_transform(X_train)
X_test_SW = cvec2.fit_transform(X_test)

X_train_NG = cvec_ngram.fit_transform(X_train)
X_test_NG = cvec_ngram2.fit_transform(X_test)

X_train_SW_df = pd.DataFrame(X_train_SW.toarray(), columns = cvec.get_feature_names())
X_test_SW_df = pd.DataFrame(X_test_SW.toarray(), columns = cvec2.get_feature_names())
display(X_train_SW_df.shape)
display(X_test_SW_df.shape)

X_train_NG_df = pd.DataFrame(X_train_NG.toarray(), columns = cvec_ngram.get_feature_names())
X_test_NG_df = pd.DataFrame(X_test_NG.toarray(), columns = cvec_ngram2.get_feature_names())
display(X_train_NG_df.shape)
display(X_test_NG_df.shape)

(4457, 7441)

(1115, 3325)

(4457, 42924)

(1115, 15085)

## Baseline accuracy
---

We need to calculate baseline accuracy in order to tell if our model is outperforming the null model (predicting the majority class).

In [153]:
# Code here please

log_reg_SW = LogisticRegression()
log_reg_SW.fit(X_train_SW_df, y_train)

log_reg_NG = LogisticRegression()
log_reg_NG.fit(X_train_NG_df, y_train)



LogisticRegression()

In [155]:
from sklearn.metrics import precision_recall_fscore_support as score
sw_pred = log_reg_SW.predict(X_test_SW_df) 
ng_pred = log_reg_NG.predict(X_test_NG_df)

ValueError: X has 3325 features per sample; expecting 7441

In [156]:
#score(y_test,sw_pred)
#score(y_test,ng_pred)

0    0.865688
1    0.134312
Name: label, dtype: float64

In [26]:
# Let's set it up with two stages:
# 1. An instance of CountVectorizer (transformer)
# 2. A LogisticRegression instance (estimator)

#pipe = Pipeline([.....])
# CODE HERE PLEASE

In [27]:
# Let's set it up with two stages:
# 1. An instance of CountVectorizer (transformer)
# 2. A LogisticRegression instance (estimator)


## `GridSearchCV`
---

At this point, you could use your `pipeline` object as a model:

```python
# Evaluate how your model will perform on unseen data
cross_val_score(pipe, X_train, y_train, cv=3).mean() 

# Fit your model
pipe.fit(X_train, y_train)

# Training score
pipe.score(X_train, y_train)

# Test score
pipe.score(X_test, y_test)
```

Since we want to tune over the `CountVectorizer`, we'll load our `pipeline` object into `GridSearchCV`.

In [28]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2500, 3000, 3500
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and bigrams).

#pipe_params = {......}

In [29]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2500, 3000, 3500
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and bigrams).



In [30]:
# Instantiate GridSearchCV.

#gs = GridSearchCV(pipe......please use 3-fold cross-validation.

In [31]:
# Instantiate GridSearchCV.



In [32]:
# Fit GridSearch to training data.

In [33]:
# Fit GridSearch to training data.


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('lr', LogisticRegression())]),
             param_grid={'cvec__max_df': [0.9, 0.95],
                         'cvec__max_features': [2500, 3000, 3500],
                         'cvec__min_df': [2, 3],
                         'cvec__ngram_range': [(1, 1), (1, 2)]})

In [34]:
# What's the best score?

In [35]:
# What's the best score?


0.980176009504255


In [36]:
# Save best model as gs_model.

In [37]:
# Save best model as gs_model.
gs_model = gs.best_estimator_

In [38]:
# Score model on training set.

In [39]:
# Score model on training set.


0.997053308331101

In [40]:
# Score model on testing set.

In [41]:
# Score model on testing set.


0.9815116911364872

## Term Frequency-Inverse Document Frequency (TF-IDF)

---

<details><summary>Which type of word do you think may be most useful in modeling?</summary>

- Words that are common in certain documents but rare in other documents tend to be more informative than words that are common in all documents or rare in all documents.
</details>

A TF-IDF score tells us which words are most differentiating between documents. Words that occur often in one document but don't occur in many documents contain a great deal of predictive power.

- The TF-IDF score is a statistical measure used to evaluate how important a word is to a document relative to all other documents.

Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

<details><summary>(BONUS) Let's see how it's calculated.</summary>
    
Term frequency (`tf`) is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$

where

- $N_\text{term}$ is the number of times a term/word $t$ appears in document $d$
- $N_\text{terms in Document}$ is the number of terms/words in document $d$

Inverse document frequency (`idf`) is defined as the frequency of documents that contain that term over the whole corpus:

$$
\mathrm{idf}(t, D) = \log\frac{N_\text{Documents}}{N_\text{Documents that contain term}}
$$

where

- $N_\text{Documents}$ is the number of documents in the corpus $D$
- $N_\text{Documents that contain term}$ is the number of documents in $D$ that contain term/word $t$

TF-IDF is then calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$

</details>

<a id='tfidf-vec'></a>
## Practice Using the `TfidfVectorizer`

---

### Why Use TF-IDF?
- Common words are penalized.
- Rare words have more influence.

Scikit-learn provides a TF-IDF vectorizer that works similarly to the other vectorizers we've covered. Notice that we can also eliminate stop words to improve our analysis.

As you did above, import and initialize the `TfidfVectorizer`, then fit the spam and ham data.

In [42]:
# create instance for the tfidf transformer - use X_train
# fit on the dataframe and see the output - X_train

In [43]:
# Fit the transformer.


In [44]:
#create a dataframe with each feature as column


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
# do the following
# Instantiate logistic regression or any other model you think should be a good selection.
# Fit logistic regression.
# Evaluate logistic regression.

In [46]:
X_train = tvec.fit_transform(X_train)

X_test = tvec.transform(X_test)

In [47]:
# Instantiate logistic regression.


# Fit logistic regression.


# Evaluate logistic regression.


Training Score: 0.9726761317974819
Testing Score: 0.9711799891245242


## To Do:<br>
Think about adding a new feature, length of the document, as a new column. You can then treat the text data in the same way as given above. Train you model and see how the things work. 
Don't you think, the length of the message should be informative?
Well, I tried this, it is actually informative!
<img src="Len_msg.png" style="float: center; height: 200px">

## Interview Question

## (BONUS) How is the information from vectorizers stored efficiently?

When you CountVectorize the text messages, you get 8,713 features and have 5,572 rows... this is 48,548,836 entries. That's a lot of data to store in a dataframe!

<details><summary>How many of these values are zero?</summary>

- 48,474,667
- About 99.85% of all values are zero!
</details>

Instead of storing all those zeroes, `sklearn` automatically stores these as a sparse matrix. It saves **a lot** of space.

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

cvec = CountVectorizer()

X_train = cvec.fit_transform(X_train)

print(type(X_train))
print(X_train[0])

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 6885)	1
  (0, 3407)	1
  (0, 6754)	1
  (0, 3977)	1
  (0, 4462)	1
  (0, 4368)	1
  (0, 853)	1
  (0, 2888)	1
  (0, 3155)	1


## Additional Resources:

### Over 100 lectures (more than 25 hours of recordings ) by _Dr. Junaid Qazi_{-}
*  [Data Science and Machine Learning using Python -- A bootcamp](https://www.youtube.com/c/DrJunaidQazi) 

The above course starts with introduction to Python and covers all the needed topics using key libraries, including `numpy, pandas, matplotlib, seaborn, plotly, scikit-learn, nltk .....`, in data science ecosystem. In additional section, NLP and basic recommender systems are discussed as well.  

The links below are specifically on NLP:

* [NLP -- Theory Lecture](https://www.youtube.com/watch?v=CKqLXAatHyw)
* [NLP -- Part-1: Hands-on (a complete spam-ham project)](https://www.youtube.com/watch?v=S2YDsGDr6Z8)
* [NLP -- Part-2: Hands-on (a complete spam-ham project)](https://www.youtube.com/watch?v=Bko0Nba57TM)
* [NLP -- Part-3: Hands-on (a complete spam-ham project)](https://www.youtube.com/watch?v=AHIl0Uy8S80)
* [NLP -- Part-4: Hands-on (a complete spam-ham project)](https://www.youtube.com/watch?v=rApsFaH3oR4)
* [NLP -- Part-5: Hands-on (a complete spam-ham project)](https://www.youtube.com/watch?v=ETBKTi7RLCw)

## License

Author: [___Dr. Junaid Qazi___](https://www.linkedin.com/in/jqazi/)

Twitter: [***@JunaidSQazi***](https://twitter.com/JunaidSQazi)

Copyright 2021

Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) (the "License").<br>you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

*Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. Please see the License for the specific language governing permissions and limitations under the License.*


*This is not an official product but sample code provided for an educational purpose.*

***Acknowledgement is requested***