<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP Practice

The stockerbot-export.csv contains around 28000 tweets about financial companies and they are labelled with the company the tweet is about.

Your task is to create a model which can correctly identify the company the tweet is about.


## 1. Open the dataframe

Hint: the data may need cleaning, set the error_bad_lines flag to False.

In [45]:
import pandas as pd

df = pd.read_csv("../assets/data/stockerbot-export.csv", error_bad_lines=False)
print(df.shape)
df.head(5)

(28264, 8)


b'Skipping line 731: expected 8 fields, saw 13\nSkipping line 2836: expected 8 fields, saw 15\nSkipping line 3058: expected 8 fields, saw 12\nSkipping line 3113: expected 8 fields, saw 12\nSkipping line 3194: expected 8 fields, saw 17\nSkipping line 3205: expected 8 fields, saw 17\nSkipping line 3255: expected 8 fields, saw 17\nSkipping line 3520: expected 8 fields, saw 17\nSkipping line 4078: expected 8 fields, saw 17\nSkipping line 4087: expected 8 fields, saw 17\nSkipping line 4088: expected 8 fields, saw 17\nSkipping line 4499: expected 8 fields, saw 12\n'


Unnamed: 0,id,text,timestamp,source,symbols,company_names,url,verified
0,1019696670777503700,VIDEO: “I was in my office. I was minding my o...,Wed Jul 18 21:33:26 +0000 2018,GoldmanSachs,GS,The Goldman Sachs,https://twitter.com/i/web/status/1019696670777...,True
1,1019709091038548000,The price of lumber $LB_F is down 22% since hi...,Wed Jul 18 22:22:47 +0000 2018,StockTwits,M,Macy's,https://twitter.com/i/web/status/1019709091038...,True
2,1019711413798035500,Who says the American Dream is dead? https://t...,Wed Jul 18 22:32:01 +0000 2018,TheStreet,AIG,American,https://buff.ly/2L3kmc4,True
3,1019716662587740200,Barry Silbert is extremely optimistic on bitco...,Wed Jul 18 22:52:52 +0000 2018,MarketWatch,BTC,Bitcoin,https://twitter.com/i/web/status/1019716662587...,True
4,1019718460287389700,How satellites avoid attacks and space junk wh...,Wed Jul 18 23:00:01 +0000 2018,Forbes,ORCL,Oracle,http://on.forbes.com/6013DqDDU,True


## 2. Check the columns of the dataframe

Identify the column related to tweets and the one related to the companies.

In [46]:
df.columns

Index(['id', 'text', 'timestamp', 'source', 'symbols', 'company_names', 'url',
       'verified'],
      dtype='object')

## 3. Transform the tweets into `computer-friendly` data

Think of the 3 steps to clean documents seen in the lesson.

In [47]:
reviews=df["text"].values
company_names=df["company_names"]
print(reviews)
print(company_names[0])

['VIDEO: “I was in my office. I was minding my own business...” –David Solomon tells $GS interns how he learned he wa… https://t.co/QClAITywXV'
 "The price of lumber $LB_F is down 22% since hitting its YTD highs. The Macy's $M turnaround is still happening.… https://t.co/XnKsV4De39"
 'Who says the American Dream is dead? https://t.co/CRgx19x7sA' ...
 "RT @invest_in_hd: 'Nuff said!  $TEL #telcoin #Telfam #crypto #Blockchain #ethereum #bitcoin $BTC $ETH https://t.co/dkRvaYzgcd"
 '【仮想通貨】ビットコインの価格上昇、８０万円台回復\u3000約１カ月半ぶり\u3000\u3000\u3000\u3000\u3000\u3000$BTC https://t.co/1OaM6ANOLX https://t.co/Ezd82kCt9L'
 'Stellar $XLM price: $0.297852 Binance registration is now OPEN for limited time! 💸 💰  ➡️… https://t.co/TteerEnNjo']
The Goldman Sachs


In [48]:
type(reviews)

numpy.ndarray

In [49]:
#lowercasing
for review in reviews:
    words=review.split()

words[:10]
lower_words = [w.lower() for w in words]
lower_words

['stellar',
 '$xlm',
 'price:',
 '$0.297852',
 'binance',
 'registration',
 'is',
 'now',
 'open',
 'for',
 'limited',
 'time!',
 '💸',
 '💰',
 '➡️…',
 'https://t.co/tteerennjo']

In [50]:
from nltk.corpus import stopwords as nltk_stopwords

stopwords = nltk_stopwords.words('english')
print(len(stopwords))
stopwords[:10]

179


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [51]:
print("number of words in the original document", len(set(reviews)))
useful_words = [word for word in reviews if word not in stopwords]
print("number of words in the original document, excluding stopwords", len(set(useful_words)))

number of words in the original document 25685
number of words in the original document, excluding stopwords 25685


In [33]:
#stemming
stemmed_words = [stemmer.stem(word) for word in useful_words]
print(useful_words[:10])
print(stemmed_words[:10])

['video:', '“i', 'office.', 'minding', 'business...”', '–david', 'solomon', 'tells', '$gs', 'interns']
['video:', '“i', 'office.', 'mind', 'business...”', '–david', 'solomon', 'tell', '$gs', 'intern']


In [34]:
#lemmatisation
from nltk import WordNetLemmatizer

lem = WordNetLemmatizer()
lemmatised_words = [lem.lemmatize(word, 'v') for word in useful_words]
print(useful_words[:10])
print(stemmed_words[:10])
print(lemmatised_words[:10])

['video:', '“i', 'office.', 'minding', 'business...”', '–david', 'solomon', 'tells', '$gs', 'interns']
['video:', '“i', 'office.', 'mind', 'business...”', '–david', 'solomon', 'tell', '$gs', 'intern']
['video:', '“i', 'office.', 'mind', 'business...”', '–david', 'solomon', 'tell', '$gs', 'intern']


In [57]:
#apply stemming
df["text_stemmed"] = df["text"].apply(lambda x: " ".join([stemmer.stem(w) for w in x.split()]))
print(df["text"].values[0])
print(df["text_stemmed"].values[0])

VIDEO: “I was in my office. I was minding my own business...” –David Solomon tells $GS interns how he learned he wa… https://t.co/QClAITywXV
video: “i was in my office. i was mind my own business...” –david solomon tell $gs intern how he learn he wa… https://t.co/qclaitywxv


## 4: Split the dataset into train and test set

In [58]:
from sklearn.model_selection import train_test_split

X = df["text_stemmed"]
y = df["company_names"]

# stratify keeps the proportions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
y.value_counts()


Twenty-First Century Fox                  131
Alphabet Inc.                             116
Discovery                                 102
Netflix                                   101
Momo Inc.                                 100
Eversource Energy                         100
The Gap                                   100
M&T Bank Corporation                      100
Honeywell International Inc.              100
Applied Materials                         100
Masco Corporation                          99
Essex Property Trust                       99
Groupon                                    99
Mohawk Industries                          99
BlackRock                                  97
TE Connectivity Ltd.                       97
Ingersoll-Rand Plc                         97
Hilton Worldwide Holdings Inc.             97
United Parcel Service                      97
Dominion Energy                            96
Nutanix                                    96
Discover Financial Services       

## 5. The additional step is to transform the tweets into vectors.



In [59]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(binary=True,
                      stop_words='english',
                      lowercase=True # default
                     )

# starting from our 2860 documents we took for training set, we translate them into bag of words, 
# i.e. dictionaries of word count
X_train_text = vec.fit_transform(X_train)
X_test_text = vec.transform(X_test)

print(len(vec.vocabulary_))
# look at some random features
print(vec.get_feature_names()[1000:1010])

33453
['1858', '185c', '186', '1860s', '187', '1870', '1885', '1895460', '189xwovcqu', '18byjsdc1o']


## 6. Use one classification technique to correctly flag the tweets

Hint: this is not a binary classification, but a multinomial one.

In [60]:
# in order to use LogisticRegression we must have numerical values as X_train
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_text, y_train)
#print(lr.intercept_, lr.coef_)

[-5.99629925 -5.80537986 -5.89020343 -5.36862796 -5.55869872 -5.83488845
 -5.75131165 -5.9508458  -5.60477304 -5.82671837 -5.61977111 -5.70533605
 -5.88248528 -4.57143688 -5.84891383 -5.80149734 -5.68052632 -5.89436081
 -5.77734259 -5.51734837 -5.70119809 -5.71883307 -5.94300933 -5.96017828
 -5.94591003 -6.01934874 -5.43714227 -5.95453348 -5.79116069 -6.01774532
 -5.82559347 -5.84026459 -5.98784609 -5.74422044 -5.94672844 -5.81575191
 -5.97584624 -5.96988129 -6.09942248 -5.86019356 -5.58279181 -5.85611472
 -5.77457873 -5.74983554 -5.91532381 -5.83475538 -5.98596459 -5.62426399
 -5.85926158 -5.49272488 -5.89653873 -5.95047416 -5.81859521 -5.9161285
 -5.73662923 -4.9287544  -5.84466931 -5.93652577 -5.78244633 -5.931577
 -5.89514568 -5.90312102 -5.80298059 -5.99135409 -5.74962481 -5.53887399
 -5.88291249 -5.76318055 -5.86330825 -5.74142157 -5.90431826 -5.53240882
 -5.65602158 -5.62347862 -5.70761595 -5.5392319  -5.69055534 -5.79927854
 -5.86413204 -5.82120266 -6.01075297 -6.02775521 -5.72

## 7. Measure the accuracy of your model

## EXTRA: why is the accuracy so low? What can you do to improve it?

Hint: try to segment your dataset into groups and apply different models to different groups.