# Natural Language Processing

### using PYTHON

<img src='NLP01.jpg'>

## What are we going to touch upon

### 1. CountVectorizer
### 2. Ngrams
### 3.Stop-Words
### 4.DF-Document frequency
### 5.Stemming
### 6.Lemmatization
### 7.TF-iDF
### 8.Sentimental Analysis
### 9.Logistic regression

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline



ModuleNotFoundError: No module named 'textblob'

<img src='getready.jpg'>

### Use Yelp dataset
<img src='yelp.jpg'>

In [2]:
# read yelp.csv into a DataFrame
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/yelp.csv'
yelp = pd.read_csv(url)

# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
print(yelp_best_worst.head())
print("------print X[0]--------")
# define X and y
X = yelp_best_worst.text
print(X[0])
print("------print y[0]--------")
y = yelp_best_worst.stars
print(y[0])
print("------print length of X_train--------")
# split the new DataFrame into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(len(X_train))

              business_id        date               review_id  stars  \
0  9yKzy9PApeiPPOUJEtnvkg  2011-01-26  fWKvX83p0-ka4JS3dc6E5A      5   
1  ZRJwVLyzEJq1VAihDhYiow  2011-07-27  IjZ33sJrzXqU-0X6U8NwyA      5   
3  _1QQZuf4zZOyFCvXc0o6Vg  2010-05-27  G-WvGaISbqqaMHlNnByodA      5   
4  6ozycU1RpktNG2-1BroVtw  2012-01-05  1uJFq2r5QfJG_6ExMRCaGw      5   
6  zp713qNhx8d9KCJJnrw1xA  2010-02-12  riFQ3vxNpP4rWLk_CSri2A      5   

                                                text    type  \
0  My wife took me here on my birthday for breakf...  review   
1  I have no idea why some people give bad review...  review   
3  Rosie, Dakota, and I LOVE Chaparral Dog Park!!...  review   
4  General Manager Scott Petello is a good egg!!!...  review   
6  Drop what you're doing and drive here. After I...  review   

                  user_id  cool  useful  funny  
0  rLtl8ZkDX5vH5nAx9C3q5Q     2       5      0  
1  0a2KyEL0d3Yb1V6aivbIuQ     0       0      0  
3  uZetl9T0NcROGOyFfughhg     1    

<img src='countvectorizer.jpg'>

<img src='BagOfWords.jpg'>
Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. Those word counts allow us to compare documents and gauge their similarities for applications like search, document classification and topic modeling.

In [3]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)

X_test_dtm = vect.transform(X_test)
print(X_train[0], X_train_dtm[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!   (0, 13745)	1
  (0, 4837)	1
  (0, 16394)	1
  (0, 7149)	1
  (0, 3999)	1
  (0, 2069)	1
  (0, 14721)	1
  (0, 126

In [4]:
# rows are documents, columns are terms (aka "tokens" or "features")
X_train_dtm.shape

(3064, 16825)

In [5]:
# last 50 features
print(vect.get_feature_names()[-50:])

['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']


In [6]:
# show vectorizer options
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [7]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
print(X_train[0], X_train_dtm[0])
X_train_dtm.shape

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!   (0, 18233)	1
  (0, 10889)	1
  (0, 20503)	1
  (0, 12846)	1
  (0, 10175)	1
  (0, 8532)	1
  (0, 19097)	1
  (0, 

(3064, 20838)

An n-gram is a contiguous sequence of n words, for example, in the sentence "dog that barks does not bite", the n-grams are:
1. unigrams (n=1): dog, that, barks, does, not, bite
2. bigrams (n=2): dog that, that barks, barks does, does not, not bite
3. trigrams (n=3): dog that barks, that barks does, barks does not, does not bite


<img src='dinosaur_ngrams.png'>

In [8]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 169847)

In [9]:
# last 50 features
print(vect.get_feature_names()[-100:])

['zia', 'zia and', 'zihuatenejo', 'zihuatenejo awaited', 'zilch', 'zilch to', 'zin', 'zin that', 'zinburger', 'zinburger all', 'zinburger are', 'zinburger last', 'zinburger truffle', 'zinburger very', 'zinburger was', 'zinburgergeist', 'zinburgergeist and', 'zinc', 'zinc lozenge', 'zinc supplementation', 'zinfandel', 'zinfandel for', 'zinfandel have', 'zing', 'zing to', 'zip', 'zip code', 'zip up', 'zipcar', 'zipcar at', 'zipper', 'zipper at', 'zippers', 'zippers flipstick', 'zipps', 'zipps decided', 'zipps or', 'ziti', 'ziti and', 'ziti stayed', 'zoe', 'zoe and', 'zoe kitchen', 'zombi', 'zombi has', 'zombi up', 'zombies', 'zombies the', 'zone', 'zone of', 'zone out', 'zone when', 'zones', 'zones dolls', 'zoning', 'zoning issues', 'zoo', 'zoo and', 'zoo is', 'zoo not', 'zoo the', 'zoo ve', 'zoyo', 'zoyo for', 'zucca', 'zucca appetizer', 'zucchini', 'zucchini and', 'zucchini bread', 'zucchini broccoli', 'zucchini carrots', 'zucchini fries', 'zucchini pieces', 'zucchini strips', 'zucchin

In [10]:
# use default options for CountVectorizer
vect = CountVectorizer()

# create documen-term matrix with normalization on both training and test set
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

### Naive Bayes
Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Naive Bayes has been studied extensively since the 1950s. It was introduced under a different name into the text retrieval community in the early 1960s and remains a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. With appropriate pre-processing, it is competitive in this domain with more advanced methods including support vector machines.
<img src='nb.PNG'>
- Posterior probability - the statistical probability that a hypothesis is true calculated in the light of relevant observations.
- Prior probability - A prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account.

<img src='NBexample.JPG'>

In [11]:
# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))
#print(y_test)

0.9187866927592955


### Let us Train the model for entire model
<img src='family-guy-gif-training.gif'>

In [12]:
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [13]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  169847
Accuracy:  0.8542074363992173


In [14]:
# show vectorizer options
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Natural language processing (nlp) is a research field that presents many challenges such as natural language understanding.
### Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

<img src='bigwords.jpg'>

In [15]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

Features:  16528
Accuracy:  0.9158512720156555


In [16]:
# set of stop words
print(vect.get_stop_words())

frozenset({'ie', 'toward', 'due', 'under', 'when', 'become', 'other', 'since', 'do', 'sixty', 'yet', 'hereby', 'its', 'eg', 'amount', 'has', 'ten', 'system', 'i', 'whereafter', 'anyhow', 'being', 'hence', 'move', 'please', 'thence', 'whatever', 'than', 'whereupon', 'whether', 'towards', 'will', 'may', 'thin', 'very', 'after', 'twenty', 'you', 'whereas', 'no', 'those', 'why', 'via', 'found', 'therein', 'across', 'him', 'can', 'is', 'etc', 'could', 'latterly', 'meanwhile', 'so', 'such', 'throughout', 'somehow', 'he', 'moreover', 'up', 'whose', 'ltd', 'also', 'anyway', 'former', 'only', 'eleven', 'about', 'made', 'top', 'until', 'or', 'myself', 'her', 'mill', 'nine', 'sometime', 'became', 'there', 'both', 'by', 'had', 'nor', 'these', 'our', 'them', 'detail', 'again', 'formerly', 'although', 'seeming', 'show', 'and', 'it', 'someone', 'third', 'fifty', 'thru', 'below', 'call', 'already', 'beside', 'front', 'but', 'upon', 'ourselves', 'whoever', 'some', 'for', 'through', 'becomes', 'whereby'

In [17]:
# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

Features:  100
Accuracy:  0.8698630136986302


In [18]:
# all 100 features
print(vect.get_feature_names())

['amazing', 'area', 'atmosphere', 'awesome', 'bad', 'bar', 'best', 'better', 'big', 'came', 'cheese', 'chicken', 'clean', 'coffee', 'come', 'day', 'definitely', 'delicious', 'did', 'didn', 'dinner', 'don', 'eat', 'excellent', 'experience', 'favorite', 'feel', 'food', 'free', 'fresh', 'friendly', 'friends', 'going', 'good', 'got', 'great', 'happy', 'home', 'hot', 'hour', 'just', 'know', 'like', 'little', 'll', 'location', 'long', 'looking', 'lot', 'love', 'lunch', 'make', 'meal', 'menu', 'minutes', 'need', 'new', 'nice', 'night', 'order', 'ordered', 'people', 'perfect', 'phoenix', 'pizza', 'place', 'pretty', 'prices', 'really', 'recommend', 'restaurant', 'right', 'said', 'salad', 'sandwich', 'sauce', 'say', 'service', 'staff', 'store', 'sure', 'table', 'thing', 'things', 'think', 'time', 'times', 'took', 'town', 'tried', 'try', 've', 'wait', 'want', 'way', 'went', 'wine', 'work', 'worth', 'years']


### Let us test with 100,000 features

In [19]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)

Features:  100000
Accuracy:  0.8855185909980431


<img src='df.jpg'>

### Let us test with document frequency where term occurance is greater than 2

In [20]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

Features:  43957
Accuracy:  0.9324853228962818


# Reset
<img src='reset.jpg'>

In [21]:
# print the first review
print(yelp_best_worst.text[0])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!


In [22]:
# save it as a TextBlob object 
#TextBlob is a Python (2 and 3) library for processing textual data. 
#It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, 
#noun phrase extraction, sentiment analysis, classification, translation, and more
review = TextBlob(yelp_best_worst.text[0]) #Using TextBlob for Text processing

NameError: name 'TextBlob' is not defined

In [None]:
# list the words
review.words

In [None]:
# list the sentences
review.sentences

In [None]:
# some string methods are available
review.lower()

### Stemming and lemmatization

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

am, are, is $\Rightarrow$ be 

car, cars, car's, cars' $\Rightarrow$ car

The result of this mapping of text will be something like:

the boy's cars are different colors $\Rightarrow$ 
the boy car be differ color

However, the two words differ in their flavor. 

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

<img src='stem.png'>

In [None]:
# initialize stemmer
stemmer = SnowballStemmer('english') #SnowballStemmer is one of the algorithm for performing stemming operation

# stem each word
print([stemmer.stem(word) for word in review.words])

<img class = 'one' src='same1.jpg'> <img class = 'one' src='lemma.jpg'> 
POS - Part of speech

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
#adverb a word or phrase that modifies the meaning of an adjective, verb, or other adverb, expressing manner, place, 
#time, or degree (e.g. gently, here, now, very ). Some adverbs, for example sentence adverbs, 
#can also be used to modify whole sentences.
print(lemmatizer.lemmatize("better", pos="a")) #better, farther

#noun a word (other than a pronoun) used to identify any of a class of people, places, or things ( common noun ), or to 
#name a particular one of these ( proper noun )
print(lemmatizer.lemmatize("cats", pos="n")) #cats, dogs

#verb a word used to describe an action, state, or occurrence, and forming the main part of the predicate of a sentence, 
#such as hear, become, happen
print(lemmatizer.lemmatize("had",pos="v")) #slept, ate

 <img class = 'two' src='same2.jpg'>

In [None]:
# assume every word is a noun
print([word.lemmatize() for word in review.words])

In [None]:
# assume every word is a verb
print([word.lemmatize(pos='v') for word in review.words])

In [None]:
# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    text = text.lower()
    words = TextBlob(text).words 
    return [word.lemmatize() for word in words]

In [None]:
# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)

In [None]:
# last 50 features which is having least significance
print(vect.get_feature_names()[-50:])

# TF-iDF
<img src='tfidf.png'>

In [None]:
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Term Frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

In [None]:
# Document Frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

In [None]:
# Term Frequency-Inverse Document Frequency (simple version)
tf/df

In [None]:
# TfidfVectorizer - Higher the value, more its significant
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

In [None]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape

In [None]:
def summarize():
    
    # choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = yelp.text[review_id]
        review_length = len(review_text)
    
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # print words with the top 5 TF-IDF scores
    print('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:10]
    for word, score in top_scores:
        print(word)
    

In [None]:
summarize()

## Importance
This can be used for document clustering. In this example, Cocktails, Oysters, Martinis, inexpensive have scored very high and it tends to signify the values and aspiration of the review crowd.

Examples of usage - Twitter trends, Cricket advertisement analysis, Launching a new product based on existing brand

# Sentiment Analysis

<img src='senti.jpg'>

In [None]:
print(review)

In [None]:
# polarity ranges from -1 (most negative) to 1 (most positive)
review.sentiment.polarity

<img src='wifi.jpg'>

### Let us do for entire Yelp dataset

In [None]:
# understanding the apply method
yelp['length'] = yelp.text.apply(len)
yelp.head(1)

In [None]:
# define a function that accepts text and returns the polarity
def detect_sentiment(text):
    return TextBlob(text).sentiment.polarity

In [None]:
# create a new DataFrame column for sentiment (WARNING: SLOW!)
yelp['sentiment'] = yelp.text.apply(detect_sentiment)

## Box Plot
<img src='box_plot.png'>

In [None]:
# box plot of sentiment grouped by stars
yelp.boxplot(column='sentiment', by='stars')

In [None]:
# reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()

In [None]:
# reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()

In [None]:
# widen the column display
pd.set_option('max_colwidth', 500)

### Marked 5 star but Sentiment analysis shows negative

In [None]:
# negative sentiment in a 5-star review
yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head(1)

### Marked 1 star but Sentiment analysis shows positive

In [None]:
# positive sentiment in a 1-star review
yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head(1)

In [None]:
# reset the column display width
pd.reset_option('max_colwidth')

<img src='bazzinga.jpg'>

# Next Logistic regression

Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression. Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. 

<img src='LogReg_1.png'>

In [None]:
# create a DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# define X and y
feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']
X = yelp_best_worst[feature_cols]
y = yelp_best_worst.stars

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# use CountVectorizer with text column only
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train.text)
X_test_dtm = vect.transform(X_test.text)
print(X_train_dtm.shape)
print(X_test_dtm.shape)

In [None]:
# shape of other four feature columns
X_train.drop('text', axis=1).shape

<img src='sparse.jpg'><img src='csr.svg'>

In [None]:
# cast other feature columns to float and convert to a sparse matrix
extra = sp.sparse.csr_matrix(X_train.drop('text', axis=1).astype(float)) #https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.csr_matrix.html
extra.shape

In [None]:
# combine sparse matrices
X_train_dtm_extra = sp.sparse.hstack((X_train_dtm, extra)) #Stack arrays in sequence horizontally (column wise).
X_train_dtm_extra.shape


In [None]:
# repeat for testing set
extra = sp.sparse.csr_matrix(X_test.drop('text', axis=1).astype(float))
X_test_dtm_extra = sp.sparse.hstack((X_test_dtm, extra))
X_test_dtm_extra.shape


In [None]:
# use logistic regression with text column only
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print(metrics.accuracy_score(y_test, y_pred_class))

In [None]:
# use logistic regression with all features
logreg = LogisticRegression()
logreg.fit(X_train_dtm_extra, y_train)
y_pred_class = logreg.predict(X_test_dtm_extra)
print(metrics.accuracy_score(y_test, y_pred_class))

### Reference:
1. https://www.youtube.com/watch?v=RZYjsw6P4nI - CountVectorizer
2. https://en.wikipedia.org/wiki/Sparse_matrix
3. http://www.nltk.org/howto/stem.html
4. http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
5. https://pythonspot.com/en/nltk-stop-words/
6. http://scikit-learn.org/stable/modules/feature_extraction.html
7. https://de.dariah.eu/tatom/working_with_text.html
8. https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL <- One of the best to learn online
9. http://zone.ni.com/reference/en-XX/help/373601B-01/lvmasmt/solving_partial_diff_eq_using_finite_element/ - Sparse matrix

<img src='keepcalm.png'>