# Natural Language Processing

## Basic NLP

```python
# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords.words('english')

# EDA: WordCloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
wc = WordCloud(background_color="white",colormap="Dark2",max_font_size=100,stopwords=sw,max_words=200)
wc.generate(my_text)
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")

# we can also use stemming

# turning each document in corpus into the bag of words format
from sklearn.feature_extraction.text import CountVectorizer
bow= CountVectorizer(stop_words=stopwords.words('english'))
bow.fit(X_train)
bow_transformed= bow.transform(X_train)

# dict of words and their index
bow.vocabulary_

# simple visualization of bow dataframe
pd.DataFrame(bow_transformed.toarray(),columns=bow.get_feature_names()).head()

# sparsity of the sparse matrix
bow_transformed.shape
bow_transformed.nnz #non-zero counts
sparsity = 1-(bow_transformed.nnz / (bow_transformed.shape[0] * bow_transformed.shape[1]))

# implementing tfidf
from sklearn.feature_extraction.text import TfidfTransformer
tfidf= TfidfTransformer().fit(bow_transformed)
tfidf_transformed= tfidf.transform(bow_transformed)

# ready to apply a model
from sklearn.naive_bayes import MultinomialNB
mnb= MultinomialNB()
mnb.fit(tfidf_transformed,y_train)
y_hat= mnb.predict(tfidf.transform(bow.transform(X_test)))

```

## Create a pipeline
```python
from sklearn.pipeline import Pipeline
pipeline= Pipeline([('bow', CountVectorizer(stop_words=stopwords.words('english'))),
                    ('tfidf', TfidfTransformer()),
                    ('mnb', MultinomialNB()) ])
```

## Sentiment Analysis
**TextBlob Module:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.

**Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
   * **Polarity**: How positive or negative a word is. -1 is very negative. +1 is very positive.
   * **Subjectivity**: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.



```python
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

df['polarity'] = df['msg'].apply(pol)
df['subjectivity'] = df['msg'].apply(sub)
```