# Topics

- Lemmetization
- Stemming
- Count Vect

- Twitter data based sentiment analysis

# Stemming

- 5 steps algorithm
- driven by rules
- Fast and easy to implement
- Tries to find the root word


- In some cases the output may not be a actual word
- if you are working on a problem which is very domain specific

- Time period were actually implemented
- One might be slightly better than other
- Language supports
- Rules might differ from one algo to another

In [1]:
from nltk.stem import PorterStemmer, LancasterStemmer

In [2]:
ps = PorterStemmer()
ls = LancasterStemmer()

In [3]:
words_example = ['cats','trouble','troubling','go','goes','going','went']

In [4]:
[ps.stem(each) for each in words_example]

['cat', 'troubl', 'troubl', 'go', 'goe', 'go', 'went']

In [5]:
# ls - []

# Lemmatization

- lemma - origin words

```'go':['going','gone']```

go
- go ing
- go ne
- went

- always provide parts of speech
- not easy and simple to implement
- english corpus
- output will be an english word for sure

- data which is just english - lemmatization
- language - research - stemming / lemmatization

- domain specific problem - IP682 - need to write own algo

In [6]:
from nltk.stem import WordNetLemmatizer

In [7]:
lm = WordNetLemmatizer()

In [8]:
words_example

['cats', 'trouble', 'troubling', 'go', 'goes', 'going', 'went']

In [9]:
for each in words_example:
    print(lm.lemmatize(each))

cat
trouble
troubling
go
go
going
went


- POS

In [10]:
another_words = ['go','going','gone','went']

In [11]:
[lm.lemmatize(each, pos='v') for each in another_words]

['go', 'go', 'go', 'go']

In [12]:
[lm.lemmatize(each) for each in another_words]

['go', 'going', 'gone', 'went']

In [13]:
# todo - create a bunch of noun words
# Try to lemmatize it

https://www.kaggle.com/c/twitter-sentiment-analysis2/data

# Sentiment Analysis - Using Twitter

In [14]:
import pandas as pd

In [15]:
train = pd.read_csv('train.csv', encoding = "ISO-8859-1")

In [16]:
train.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


- encoding

- charecters to computers
- ASCII, UNICODE ( utf-8 ), ISO

## pull out some numbers

In [17]:
# number of obs in the classes
train.Sentiment.value_counts()

1    56457
0    43532
Name: Sentiment, dtype: int64

In [18]:
train[train.Sentiment == 0].shape

(43532, 3)

In [19]:
train[train.Sentiment == 1].shape

(56457, 3)

# Sentiment Analysis - Algorithm

- spam or not spam
- spam [word1, word2, word3]
- not_spam [word1, word2, word3]

- email - word by word 
number of words in spam
number of words not in spam


- domain specific problem

### Top 10 words that exist in our positive tweet

In [20]:
from nltk.tokenize import TweetTokenizer
tk = TweetTokenizer()

In [21]:
from nltk.probability import FreqDist

In [22]:
all_positives = train[train.Sentiment == 1] # filtering our data

In [23]:
all_positives.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
2,3,1,omg its already 7:30 :O
6,7,1,Juuuuuuuuuuuuuuuuussssst Chillin!!
8,9,1,handed in my uniform today . i miss you ...
9,10,1,hmmmm.... i wonder how she my number @-)
11,12,1,thanks to all the haters up in my face a...


In [24]:
whole_text = ' '.join(all_positives.SentimentText.values)

In [25]:
pos_tokens = tk.tokenize(whole_text)

In [26]:
from nltk.corpus import stopwords
stop_word = stopwords.words('english')

In [27]:
pos_token_wo_sp = [each for each in pos_tokens if each.lower() not in stop_word]

In [28]:
fdist=FreqDist(pos_tokens)

In [29]:
fdist.most_common(30)

[('!', 34630),
 ('.', 26594),
 (',', 19661),
 ('you', 16316),
 ('the', 16192),
 ('I', 15820),
 ('to', 14840),
 ('a', 12733),
 ('?', 12692),
 ('it', 8624),
 ('...', 8361),
 ('and', 8356),
 ('for', 7596),
 ('i', 6745),
 ('is', 6362),
 ('"', 6207),
 ('in', 6083),
 ('of', 5856),
 ('my', 5691),
 ('that', 5656),
 ('on', 5253),
 ('me', 4946),
 ('have', 4536),
 ('-', 4532),
 ('your', 4229),
 ('be', 3966),
 ('so', 3952),
 ('are', 3703),
 ('..', 3428),
 ('good', 3381)]

In [30]:
fdist_wo_sp = FreqDist(pos_token_wo_sp)
fdist_wo_sp.most_common(30)

[('!', 34630),
 ('.', 26594),
 (',', 19661),
 ('?', 12692),
 ('...', 8361),
 ('"', 6207),
 ('-', 4532),
 ('..', 3428),
 ('good', 3381),
 ("I'm", 3282),
 ('like', 3049),
 ('love', 2833),
 ('u', 2641),
 ('know', 2389),
 ('get', 2365),
 ('*', 2315),
 ('lol', 2050),
 ('thanks', 2007),
 ('one', 1907),
 ('day', 1854),
 (':', 1822),
 ('&', 1813),
 (')', 1797),
 ('see', 1747),
 ("'", 1738),
 ('(', 1672),
 ('time', 1623),
 ('well', 1594),
 ('haha', 1593),
 ('think', 1550)]

### Top 10 words that exist in our negative tweet
- todo

## Count vec

In [31]:
'some text like this',
'another text',
'some more text'

'some more text'

```
row 1 - 1 0 12 14 15  1
row 2 - 1 0 14 17 18  0
row 3 - 1 0 1 0 2 1   1
row 4 - 1 0 1 0 8 1   0
```

In [32]:
corpus = ['This is the first, document first document',
          'This document is the second document',
          'And this is the third one.',
          'Is this is the first document?']

In [33]:
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
vectorizer = CountVectorizer(ngram_range=(1,2))

In [35]:
X = vectorizer.fit_transform(corpus)

In [36]:
X

<4x22 sparse matrix of type '<class 'numpy.int64'>'
	with 41 stored elements in Compressed Sparse Row format>

In [37]:
print(vectorizer.get_feature_names())
print(X.toarray())

['and', 'and this', 'document', 'document first', 'document is', 'first', 'first document', 'is', 'is the', 'is this', 'one', 'second', 'second document', 'the', 'the first', 'the second', 'the third', 'third', 'third one', 'this', 'this document', 'this is']
[[0 0 2 1 0 2 2 1 1 0 0 0 0 1 1 0 0 0 0 1 0 1]
 [0 0 2 0 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 1 1 0]
 [1 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1]
 [0 0 1 0 0 1 1 2 1 1 0 0 0 1 1 0 0 0 0 1 0 1]]


- lowercase
- stop_words - ```'english, []```
- max_df - upperbound
- min_df - lowebound
- preprocessor=function name - stemming or lemma
- binary - 0 or 1 instead of count
- ngram_range=(1,3)

## First Sentiment Analysis Model

In [38]:
train.SentimentText

0                             is so sad for my APL frie...
1                           I missed the New Moon trail...
2                                  omg its already 7:30 :O
3                  .. Omgaga. Im sooo  im gunna CRy. I'...
4                 i think mi bf is cheating on me!!!   ...
                               ...                        
99984    @Cupcake  seems like a repeating problem   hop...
99985    @cupcake__ arrrr we both replied to each other...
99986                       @CuPcAkE_2120 ya i thought so 
99987    @Cupcake_Dollie Yes. Yes. I'm glad you had mor...
99988                      @cupcake_kayla haha yes you do 
Name: SentimentText, Length: 99989, dtype: object

In [39]:
count_vect = CountVectorizer(stop_words='english'
                            ,ngram_range=(1,2))
X_train_counts = count_vect.fit_transform(train.SentimentText)

In [40]:
X_train_counts

<99989x568934 sparse matrix of type '<class 'numpy.int64'>'
	with 1315619 stored elements in Compressed Sparse Row format>

In [41]:
#MNB - Todo - Revise MNB
from sklearn.naive_bayes import MultinomialNB

In [42]:
model = MultinomialNB().fit(X_train_counts, train.Sentiment)

In [43]:
docs_new = ['Love is beautiful','the service sucks','an aweful day',
           'the chocolate was amazing']

In [44]:
X_new_counts = count_vect.transform(docs_new)

In [47]:
X_new_counts

<4x568934 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [45]:
predicted = model.predict(X_new_counts) 

In [46]:
predicted

array([1, 0, 0, 1])