# Challenge 2: Sentiment Analysis

In this challenge we will learn sentiment analysis and practice performing sentiment analysis on Twitter tweets. 

## Introduction

Sentiment analysis is to *systematically identify, extract, quantify, and study affective states and subjective information* based on texts ([reference](https://en.wikipedia.org/wiki/Sentiment_analysis)). In simple words, it's to understand whether a person is happy or unhappy in producing the piece of text. Why we (or rather, companies) care about sentiment in texts? It's because by understanding the sentiments in texts, we will be able to know if our customers are happy or unhappy about our products and services. If they are unhappy, the subsequent action is to figure out what have caused the unhappiness and make improvements.

Basic sentiment analysis only understands the *positive* or *negative* (sometimes *neutral* too) polarities of the sentiment. More advanced sentiment analysis will also consider dimensions such as agreement, subjectivity, confidence, irony, and so on. In this challenge we will conduct the basic positive vs negative sentiment analysis based on real Twitter tweets.

NLTK comes with a [sentiment analysis package](https://www.nltk.org/api/nltk.sentiment.html). This package is great for dummies to perform sentiment analysis because it requires only the textual data to make predictions. For example:

```python
>>> from nltk.sentiment.vader import SentimentIntensityAnalyzer
>>> txt = "Ironhack is a Global Tech School ranked num 2 worldwide.   Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do."
>>> analyzer = SentimentIntensityAnalyzer()
>>> analyzer.polarity_scores(txt)
{'neg': 0.0, 'neu': 0.741, 'pos': 0.259, 'compound': 0.8442}
```

In this challenge, however, you will not use NLTK's sentiment analysis package because in your Machine Learning training in the past 2 weeks you have learned how to make predictions more accurate than that. The [tweets data](https://www.kaggle.com/kazanova/sentiment140) we will be using today are already coded for the positive/negative sentiment. You will be able to use the Naïve Bayes classifier you learned in the lesson to predict the sentiment of tweets based on the labels.

## Conducting Sentiment Analysis

### Loading and Exploring Data

The dataset we'll be using today is located in the lab directory named `Sentiment140.csv.zip`. You need to unzip it into a `.csv` file. Then in the cell below, load and explore the data.

*Notes:* 

* The dataset was downloaded from [Kaggle](https://www.kaggle.com/kazanova/sentiment140). We made a slight change on the original data so that each column has a label.

* The dataset is huuuuge (1.6m tweets). When you develop your data analysis codes, you can sample a subset of the data (e.g. 20k records) so that you will save a lot of time when you test your codes.

In [5]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from nltk import word_tokenize
import nltk
stopwords = stopwords.words('english')

tweets = pd.read_csv('../data/Sentiment140.csv')

In [6]:
def clean_up(s):
     return re.sub("http\S+|[^a-zA-Z]", " ", s.lower())
def tokenize(s):
    return nltk.word_tokenize(s)
def stem_and_lemmatize(l):
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    return [lemmatizer.lemmatize(stemmer.stem(word)) for word in l]
def remove_stopwords(l):
    return [word for word in l if word not in stopwords]

### Prepare Textual Data for Sentiment Analysis

Now, apply the functions you have written in Challenge 1 to your whole data set. These functions include:

* `clean_up()`

* `tokenize()`

* `stem_and_lemmatize()`

* `remove_stopwords()`

Create a new column called `text_processed` in the dataframe to contain the processed data. At the end, your `text_processed` column should contain lists of word tokens that are cleaned up. Your data should look like below:

![Processed Data](../images/data-cleaning-results.png)

In [7]:
sample = tweets.sample(20000)
def bag_of_words(string):
    string = clean_up(string)
    string = tokenize(string)
    string = stem_and_lemmatize(string)
    string = remove_stopwords(string)
    return string

sample['text_processed'] = sample['text'].apply(bag_of_words)


In [8]:
words = [word for lst in sample['text_processed'] for word in lst]

### Creating Bag of Words

The purpose of this step is to create a [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) from the processed data. The bag of words contains all the unique words in your whole text body (a.k.a. *corpus*) with the number of occurrence of each word. It will allow you to understand which words are the most important features across the whole corpus.

Also, you can imagine you will have a massive set of words. The less important words (i.e. those of very low number of occurrence) do not contribute much to the sentiment. Therefore, you only need to use the most important words to build your feature set in the next step. In our case, we will use the top 5,000 words with the highest frequency to build the features.

In the cell below, combine all the words in `text_processed` and calculate the frequency distribution of all words. A convenient library to calculate the term frequency distribution is NLTK's `FreqDist` class ([documentation](https://www.nltk.org/api/nltk.html#module-nltk.probability)). Then select the top 5,000 words from the frequency distribution.

In [14]:
from nltk.probability import FreqDist
cfdist = FreqDist(words)
cfdist

FreqDist({'go': 1722, 'day': 1444, 'wa': 1329, 'get': 1317, 'thi': 1193, 'good': 1122, 'work': 1042, 'love': 1041, 'like': 1041, 'quot': 936, ...})

### Building Features

Now let's build the features. Using the top 5,000 words, create a 2-dimensional matrix to record whether each of those words is contained in each document (tweet). Then you also have an output column to indicate whether the sentiment in each tweet is positive. For example, assuming your bag of words has 5 items (`['one', 'two', 'three', 'four', 'five']`) out of 4 documents (`['A', 'B', 'C', 'D']`), your feature set is essentially:

| Doc | one | two | three | four | five | is_positive |
|---|---|---|---|---|---|---|
| A | True | False | False | True | False | True |
| B | False | False | False | True | True | False |
| C | False | True | False | False | False | True |
| D | True | False | False | False | True | False|

However, because the `nltk.NaiveBayesClassifier.train` class we will use in the next step does not work with Pandas dataframe, the structure of your feature set should be converted to the Python list looking like below:

```python
[
	({
		'one': True,
		'two': False,
		'three': False,
		'four': True,
		'five': False
	}, True),
	({
		'one': False,
		'two': False,
		'three': False,
		'four': True,
		'five': True
	}, False),
	({
		'one': False,
		'two': True,
		'three': False,
		'four': False,
		'five': False
	}, True),
	({
		'one': True,
		'two': False,
		'three': False,
		'four': False,
		'five': True
	}, False)
]
```

To help you in this step, watch the [following video](https://www.youtube.com/watch?v=-vVskDsHcVc) to learn how to build the feature set with Python and NLTK. The source code in this video can be found [here](https://pythonprogramming.net/words-as-features-nltk-tutorial/).

[![Building Features](../images/building-features.jpg)](https://www.youtube.com/watch?v=-vVskDsHcVc)

In [15]:
cfdist = list(cfdist.keys())[:5000]

def find_features(document):
    doc_words = set(document)
    features = {}
    for w in words:
        features[w] = (w in doc_words)
    return features

In [17]:
featuresets = []
for index, row in sample.iterrows():
    featuresets.append((find_features(row['text_processed']), row['target']==4))

1484097
1450060
1556567
1358012
595823
1492215
1212086
589894
1207983
876322
1158479
848311
1044398
513554
997089
343206
685932
1122145
663024
510044
1195616
702716
1383548
1506716
99405
1087986
892839
842420
1286972
1401920
450042
350253
1390607
642318
877032
700213
412087
1241432
699939
1092164
232806
232947
1023243
98255
210551
153120
634368
1077764
972237
183285
85736
909510
413135
863769
247225
601294
1067010
568590
571620
3219
955007
1123698
408900
825575
419629
192965
1500967
479133
1069191
1093587
284558
509552
1117427
505149
497387
235850
908084
285105
1060834
483235
714820
220004
272923
210557
230445
486311
820168
1409864
564436
168863
430232
1395679
601791
105343
1057051
521108
1045234
144687
1561689
1063971
340374
1525312
486736
677436
1237121
1129534
147742
765557
1534157
299844
1304630
454779
230294
1190042
962207
1205917
737500
123426
846742
1363545
708015
1448861
1073568
790311
909305
1060882
418799
1297744
1076076
1333227
885782
359406
1302754
1145969
1010981
608881
12

256559
921651
1248991
51937
370304
356241
641764
837049
153279
580741
583533
808470
544443
1124314
261998
1032953
570168
286260
24643
755095
1586842
945870
180363
381865
502007
830618
346273
1322122
1038727
556757
1069178
627209
611164
881438
1419294
935155
1149435
1502711
181866
466183
1024319
1520361
1291041
848047
1353869
689188
282569
1531654
1098275
1106849
584905
688017
1004730
549688
585446
344716
1084566
323987
295270
1209141
519800
1277280
505229
174978
1593143
140404
108704
1338758
637514
1127656
941093
629305
1067131
1580920
1034075
701121
196172
458831
1318683
20059
888851
436457
305625
822347
1542785
799785
413068
1086287
534885
1150010
1004867
857311
281457
1336860
1279070
533871
147168
1544254
243356
1468413
1247685
658481
1220377
1278987
1217202
1310738
1208013
745283
944887
321332
1056687
1230180
733574
1348686
1102118
137706
1452091
1533516
1454167
792511
672418
138105
1030587
998214
926383
667763
418661
1187197
770892
200817
279775
156784
200875
24100
1293235
666806


316872
1496964
167207
787746
1059561
1045294
1133904
491274
250585
1395735
899359
175949
684819
542640
1365575
553398
705922
721740
83759
387776
1356213
1160115
596524
73627
575299
178897
436284
1143123
1091677
1481677
1135469
33171
46076
549431
405282
961613
208146
1362647
756929
1023251
1162003
1109742
723856
1544394
657951
753763
686150
590533
1290994
305487
773550
1402317
405131
741989
485283
1153569
494742
1048181
1126040
325566
14059
1058088
135317
68669
1032018
1230582
1142554
984036
1069746
370990
1268529
255813
1169015
1341230
1052427
621993
601849
218641
1513749
609399
1382218
861314
574463
585013
228272
1381814
1147504
223418
446722
1117079
873729
595556
147124
1388108
382627
1178847
749109
716014
1124645
853318
1016884
715311
778687
1413436
1241522
807627
588312
301958
11494
1444304
592028
228081
153008
1496776
1181821
827112
1033673
701936
680236
774549
114160
784377
841091
377891
1380219
392016
1115847
927225
819847
489292
128860
1555262
855183
252187
294462
982963
323793

426451
1143725
424034
285372
243751
953762
990591
287040
1562296
856351
1551070
777828
802182
1325593
669360
214386
293504
1173819
1435354
1350735
1225784
287701
70032
269267
1504982
1152674
827789
294756
374836
636995
1158836
1108431
1366806
12655
1461688
738964
355945
916988
86132
642421
43688
180261
718846
320810
53309
1069559
1078828
227262
300989
1283099
1096688
575646
1331646
790972
1499962
763730
1584127
538262
394186
478780
860710
92292
933480
1106537
272788
1523598
1570908
1491206
436975
493156
634648
1479531
346492
472455
784155
16815
830664
569609
109977
1347075
1276912
778495
1303507
1047267
1482056
472208
1103688
1005581
647111
786255
1356779
10764
768978
226203
516998
346854
567358
521182
626133
174264
1476987
1264534
1485641
718962
1239106
491305
647982
432551
391847
1504585
412185
934638
482617
14188
1397481
994556
265669
258922
846877
423533
603291
598720
131029
989961
676113
1566025
45587
840442
1475993
919723
235055
1228777
1586010
1058371
927920
343935
1464553
15738

310283
439670
1518723
295150
473209
1384436
454396
709333
1065537
363802
303734
829025
894110
243987
1453482
443396
1400331
94729
860025
1095123
1253581
1407046
1122848
1026975
155598
987595
850797
1383405
1284987
325951
699941
1095823
871806
908641
722321
265196
1290144
564347
1310939
212741
937786
676845
1093912
1115673
1557610
1214428
15736
731268
967054
1414757
306885
9119
263785
1087434
759142
932760
1264491
1204786
783159
514601
454284
1005025
1487262
163122
54491
565433
594779
422111
1330052
965018
186203
984209
390690
827520
1564623
1246275
351283
230457
908093
511498
882582
906397
573133
1583572
1254839
805830
133366
1479606
668249
1420104
623784
769381
126160
1103293
508382
893157
330785
1420822
1169176
267036
141920
556703
11805
398260
679437
260223
1260633
294835
1515174
896648
1330158
971840
852258
641456
790756
24300
106304
776514
881572
472671
1262789
603921
687051
1485846
1091609
364721
1518040
338808
730110
375330
1398934
560909
691536
483367
416930
472833
454156
80143

1415490
1485202
1278996
1257589
1365820
296932
1004738
870297
1222950
1288991
1372228
1514590
588621
512170
55207
973117
685831
1377801
640320
129156
1207811
502913
363256
883749
556809
255721
1154181
119086
984233
406634
131143
1146968
1488928
594085
20646
407949
1512782
572208
995419
920894
1560930
424801
383596
380305
1021806
1569375
333229
1566919
23323
1428987
598678
334488
618103
239250
540471
566365
247464
937185
184728
1155675
1428003
765340
1472817
297446
1570727
1564513
812280
985463
988031
1233342
1535993
520229
593378
1139350
911262
1468558
560272
504972
770940
1180865
797370
1041165
970773
198629
1455460
383253
321881
748902
1029646
424795
1287364
1116154
836570
1243115
107300
12334
1411221
1505256
879372
764143
707679
1277667
1455280
839086
1066895
632629
1127149
774685
325075
105724
645519
1573849
1257682
533300
897043
945107
1318755
1581858
1138336
1210273
5243
995619
176854
1242290
784201
1411354
234493
1245112
1062514
285295
711825
959892
414783
391542
1216767
247653


1020770
307669
198317
1578263
1149952
1133921
1264455
193049
979841
298751
1539794
178174
1268747
574844
1255677
958545
649349
200981
622448
1327741
1544883
1414022
525888
1526255
1320903
141483
799761
465210
1326636
515273
1305804
1355729
1412124
872854
1019615
166295
1283089
1295188
1438250
492533
454547
933325
876947
1578204
612141
1176041
890842
645523
701891
1200580
913231
111119
556278
659741
1418033
1266022
511270
1400712
513943
1338706
338569
934780
755307
542078
82440
1034520
1435881
1222824
1383048
1429258
756260
548600
234076
1074626
355915
412262
463716
840401
407385
1567882
754044
1128573
1336751
199889
1132536
1251887
1527695
369600
954887
1510846
293180
1469152
154888
173430
1047428
505699
1563290
1384771
588068
416228
1334581
1301565
866635
1109085
274393
777646
773912
532502
1313698
28591
525187
1166664
581166
376709
1531902
54522
1089343
293054
771485
1381748
481046
1483938
1275378
198087
1043524
891368
293299
874098
641744
364275
625161
1455410
1432776
156250
1318360

454426
1371143
1116602
606732
1409522
62956
1554235
1429206
249269
1106827
852017
1482517
1361720
279818
43112
1188754
85301
488595
1203599
669857
720506
393830
169044
1267486
89641
1239117
220819
504869
971635
1525848
669675
124593
529707
757671
885487
258405
312852
1235954
1544403
1054968
164374
1240244
602527
699402
295971
804291
1333915
88595
131702
25576
107082
1026714
812060
546218
873881
179277
737441
892205
1170812
1569150
1412520
66896
1400634
810995
97274
644267
438364
249555
279712
1596567
718817
30410
785541
1575552
1397282
1108445
645809
791361
449580
1489820
133357
1522753
606603
1282947
1060759
1355648
1275930
114819
649533
1223459
1326058
495214
763567
417540
1483373
1517412
157392
1448981
1501530
495158
505272
547954
1271797
864058
692147
577492
1408395
1086782
1383676
277683
821683
199558
1065818
85321
1163572
695122
810234
1514738
1543601
1587163
1536379
12724
380692
1114283
114488
77343
124347
1529833
49896
1120104
1132560
604765
453786
56953
512001
619424
291804
60

1199145
229468
1341831
1271064
78643
1258604
1130758
100260
383976
1104160
1052053
82511
327843
186285
826027
1356039
772077
1461762
1376493
685392
920836
173607
949879
415999
386772
1547212
823035
953355
441011
583830
784389
1158946
942811
577987
1498832
1255050
544606
1233641
195446
1508732
527028
927373
1142094
492335
310981
2556
1587986
1296170
1320246
451852
1032144
1290837
263112
8814
474487
472095
407833
421642
324979
91557
816406
725099
929270
1300986
466602
1386585
550089
1596504
186950
156757
625897
955937
755823
718184
1585658
1582340
220262
1544782
32993
156142
308071
1489974
1180137
811071
329622
505294
778184
1148349
110839
164453
753104
199408
708427
741523
338570
994999
894960
63483
741912
1465580
319077
1214581
1501554
1522461
1274967
1182905
963507
1331758
1053243
1405912
989763
455036
486592
1239253
311036
388688
1397691
649426
1347818
237312
663067
1382964
976931
615019
1512178
316414
741177
1349535
244079
1338360
523706
1550463
1399164
1294171
276841
361584
442211


803987
1450691
1575503
1428602
552002
1240395
365609
1122104
1274332
689305
1061062
1593215
1377370
721393
1213835
1563809
1056397
861681
572346
1443678
607447
1245797
1574469
78759
1548287
819
769214
990320
1193962
990784
23426
489961
910755
50084
1506470
1056581
699772
94548
982880
182477
856388
797202
1487860
996488
264850
1273694
432883
630395
886957
485836
1491059
1087626
249091
1427111
1026645
1471184
282325
1471456
677631
373469
479727
1523519
583535
274724
750247
957320
863648
1050769
366954
703280
516796
701275
871232
329455
1594172
1420833
36052
1385964
1236309
304727
292606
127438
95338
878522
20152
255057
215884
1264904
1244499
1314872
1025997
195981
936101
638548
270470
1170716
480410
994417
1541815
1344466
956480
510667
1120389
370487
1402062
39936
689900
1200
1041836
1326553
583517
1368453
863035
164807
288266
1115816
175847
888247
1071195
769549
942193
675468
1577982
86121
418814
1230265
1284843
1198199
1391711
1215351
245048
871998
931998
1481705
65619
685484
61251
118

1385643
1549256
268678
27965
293991
415492
145545
865276
311859
1466808
415595
648135
14164
634437
718490
18794
1559089
773405
337292
736
1196619
266438
1380985
883549
1586922
1562592
1495211
90854
693614
389820
1515513
505451
638216
545599
23643
830007
870330
484920
299058
1060792
16879
102473
427416
987337
1120538
1472714
284994
986444
411262
376656
1592263
957958
262038
801934
1230868
1254199
297229
1389012
1186625
3134
1293940
259051
679695
135492
769852
378925
526564
33803
892899
101798
1097107
208777
1560543
1578750
1084822
817236
1340230
1093156
556790
1114277
926017
872810
925215
562095
70105
784812
1433192
616570
617786
896932
172305
1228282
621231
170439
406568
712112
1406667
1206019
269135
184683
935925
1516883
1575744
1426508
365397
60897
210880
1524906
544559
272686
962302
1373003
1061579
843330
1190774
1067218
1105290
1585989
728495
21745
541644
1354881
1140782
1543812
213602
405594
525605
754978
1222013
1120412
1176441
837309
1223460
122134
486604
1155738
1505113
282601


154628
1541904
1191506
1447718
710771
902356
616384
294086
1582359
576663
1249430
1198833
232912
1004769
1394475
286063
1222575
174253
1119683
634024
941650
707761
1432985
603944
592990
1582683
627088
637299
69234
1150428
1451705
93367
643749
354729
1393351
998455
1088233
203411
79503
1072060
843095
204807
505342
320894
1369408
1313239
169255
343067
443821
433487
1162480
814818
720243
936917
937661
105344
1003121
1399642
1087242
1310882
487691
1423135
94257
1549227
280618
928113
591915
698734
45328
1426321
811158
1246526
1342904
337614
725618
1354107
345606
1284895
842461
719881
743325
1217040
1357662
280370
340014
316165
231748
161192
386718
956069
639885
1055602
1492595
1397324
894672
1112250
974977
194490
1497099
853304
168525
881715
290642
778644
294241
121506
1542200
792093
842423
1194539
345848
523980
1112138
582889
548981
1393653
1589564
1579800
650923
9062
1290370
318199
1582268
1289329
1542991
956127
1579959
599130
688013
669560
590342
240720
1037726
862497
1310305
870182
8261

677266
1517660
1235344
740301
307694
289172
1551256
1005273
874457
282876
366782
1101834
1326450
318339
1155159
303862
234613
654630
366258
908488
825700
788180
1435065
660100
199274
1067207
564379
1493069
312778
1539423
1175225
1378171
769964
1360646
841846
428783
112411
985202
1243817
510870
1565002
1527985
262411
1333039
166045
1366256
676390
1179655
864251
886793
403139
220840
1196416
884019
206471
1135177
619322
21224
942632
1140501
708804
1044749
1510479
1278609
318010
396380
1167052
495505
701231
1470224
114881
70942
1298571
1481938
731488
722297
1049803
903151
617283
277963
627107
360619
759504
266764
158047
963505
892079
61887
383332
156773
577402
614810
971265
962243
72274
1412920
99455
45182
537724
909422
1232623
124011
141609
339276
302011
1149721
1392606
477892
1262421
1560094
102386
1278909
903634
1368722
1304904
675689
84978
107971
950368
174271
900881
629576
360890
227464
1470412
1039
1265096
423371
598508
1547638
470657
460321
1549764
1263865
208785
873328
788331
10684

910572
1403469
251712
1187715
549956
1384828
2690
319102
372121
1074652
201813
460850
35830
386545
999229
164183
360149
29691
423417
1171839
314609
1048093
935577
930207
84158
1159135
570012
727904
336856
642471
427880
7095
696021
836259
574053
1419360
316706
1420691
387464
1294098
1306856
634723
235874
1117796
1406559
1279751
134903
115959
12377
893162
1077920
1081210
489309
1104468
1543045
85270
281844
1324318
1300000
346214
509701
1208115
1297300
779688
1356959
1516296
992287
243517
1282709
1569209
572783
373994
200947
1151210
72092
968744
537390
1597534
547693
344891
1513702
1389740
1359832
752674
1459522
290831
76976
1072020
159772
731812
33537
449046
1436778
982211
1545358
1364144
656418
122431
251176
1306278
306973
348095
1067573
697356
1055866
1537470
875491
801395
1332109
469188
631138
216647
119516
724834
971079
854546
426254
1590264
651171
590651
849649
549589
737229
823994
210403
918816
403132
638287
1495671
583572
426213
714537
766253
900925
5921
1504414
1440834
1091531
83

1235750
650312
469320
474531
1309883
1205723
167561
326104
1424786
1020518
246520
73198
1126908
445580
382063
937987
941692
1294157
572816
936140
1157893
1215320
83137
759101
1594261
605856
1309519
512576
345104
750901
285181
737669
444520
534070
800201
316204
1107003
218696
1407742
471693
726964
320051
144638
1173813
377231
1063647
1161721
1204503
698761
285614
703821
854545
1401361
453652
134646
937635
520814
1491579
188921
660896
360341
759521
741465
562588
61382
876465
1451350
686695
206531
290691
1385588
548521
725606
836303
1037117
187651
1175825
216225
1250398
88361
278470
1167335
1007732
353489
878602
596374
1528227
1349848
264855
1001125
1064679
1478788
1352170
1116734
1399184
887887
1369016
1022740
572286
1147938
1016991
97723
902804
531816
1300190
1213149
109955
199235
936018
1395015
473512
184188
274808
997956
339763
1102360
226749
1517793
1485597
1409597
1245312
386508
1522478
109088
10509
558586
971232
536021
1326609
956098
1434534
809683
689518
394877
1576643
454888
2798

1037500
813631
1422656
470087
1282826
1279240
739462
326974
263096
1117459
1580230
991600
1517953
1058851
463534
1217623
1400475
383364
171560
1445320
781940
140510
455180
143087
1457047
185495
1595580
1252458
577194
674595
221559
174976
576552
1022529
1102105
839813
1507339
453559
1471952
539601
1278229
1127321
252041
1460745
1436296
1335641
1537526
302595
545925
1134336
118565
895402
852111
634076
1169743
1331356
1311959
694714
424055
1512013
947399
209731
482009
705703
610731
922306
230186
945284
501435
589456
699603
1166036
1415375
686690
864311
1243496
572784
1268219
1255685
603584
555966
1349627
80756
822576
803225
1531848
710321
493093
545463
1167473
1222332
1005173
651661
1218339
623930
332187
898531
786461
601860
1355482
1125210
262441
1558353
288218
868186
890781
110947
743400
1229403
685536
550450
1134821
40731
1366302
1449757
982825
750853
942467
556744
896741
1419443
589600
842314
1069057
879856
801945
1457013
763462
161487
585379
1026969
1010816
1596367
345710
1108277
104

1120789
605871
1477449
596541
742709
205412
1048603
430216
796357
527101
1431384
1243769
1141852
5031
952816
1553799
500473
1260316
676135
567530
486514
1348036
130262
1537856
1451860
1500011
1431778
1366180
332661
379824
125749
1541267
98407
986299
282600
341486
1124156
805152
732347
121588
859553
1235883
1199244
1065797
838131
116911
1431508
1316274
190776
405105
1390056
1010464
430051
488349
1004763
277996
380772
1278806
401582
751286
652826
1086761
1277053
1091372
222182
1355091
419657
993557
1563839
1273890
1598313
1082456
1549818
363778
493487
853735
921327
1516967
1148044
149169
104280
347237
402399
863718
1494524
454919
1139806
1015200
510168
1150036
1552948
569093
1325651
150058
998454
252312
623410
501863
889872
700624
39131
522972
1125051
558938
783658
426937
1060029
1035463
385970
279110
551615
802608
427404
926008
210597
1190505
1266856
422642
1550726
928877
1300457
1299283
304336
145248
876889
26828
869729
1004704
369867
34431
190340
1233791
1182125
1521063
871497
1012321

KeyboardInterrupt: 

### Building and Traininng Naive Bayes Model

In this step you will split your feature set into a training and a test set. Then you will create a Bayes classifier instance using `nltk.NaiveBayesClassifier.train` ([example](https://www.nltk.org/book/ch06.html)) to train with the training dataset.

After training the model, call `classifier.show_most_informative_features()` to inspect the most important features. The output will look like:

```
Most Informative Features
	    snow = True            False : True   =     34.3 : 1.0
	  easter = True            False : True   =     26.2 : 1.0
	 headach = True            False : True   =     20.9 : 1.0
	    argh = True            False : True   =     17.6 : 1.0
	unfortun = True            False : True   =     16.9 : 1.0
	    jona = True             True : False  =     16.2 : 1.0
	     ach = True            False : True   =     14.9 : 1.0
	     sad = True            False : True   =     13.0 : 1.0
	  parent = True            False : True   =     12.9 : 1.0
	  spring = True            False : True   =     12.7 : 1.0
```

The [following video](https://www.youtube.com/watch?v=rISOsUaTrO4) will help you complete this step. The source code in this video can be found [here](https://pythonprogramming.net/naive-bayes-classifier-nltk-tutorial/).

[![Building and Training NB](../images/nb-model-building.jpg)](https://www.youtube.com/watch?v=rISOsUaTrO4)

In [23]:
X_train, X_test = train_test_split(featuresets)
classifier = nltk.NaiveBayesClassifier.train(X_train)

In [24]:
classifier.show_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features_most_informative_features()

Most Informative Features
                     sad = True            False : True   =     26.8 : 1.0
                 depress = True            False : True   =     20.0 : 1.0
                  throat = True            False : True   =     16.7 : 1.0
                 headach = True            False : True   =     16.4 : 1.0
            followfriday = True             True : False  =     14.6 : 1.0
                  cancel = True            False : True   =     14.1 : 1.0
                   sadli = True            False : True   =     14.1 : 1.0
                 terribl = True            False : True   =     13.5 : 1.0
                    hurt = True            False : True   =     13.2 : 1.0
                    wors = True            False : True   =     12.1 : 1.0


### Testing Naive Bayes Model

Now we'll test our classifier with the test dataset. This is done by calling `nltk.classify.accuracy(classifier, test)`.

As mentioned in one of the tutorial videos, a Naive Bayes model is considered OK if your accuracy score is over 0.6. If your accuracy score is over 0.7, you've done a great job!

In [25]:
nltk.classify.accuracy(classifier, X_test)

0.7205022732193115

## Bonus Question 1: Improve Model Performance

If you are still not exhausted so far and want to dig deeper, try to improve your classifier performance. There are many aspects you can dig into, for example:

* Improve stemming and lemmatization. Inspect your bag of words and the most important features. Are there any words you should furuther remove from analysis? You can append these words to further remove to the stop words list.

* Remember we only used the top 5,000 features to build model? Try using different numbers of top features. The bottom line is to use as few features as you can without compromising your model performance. The fewer features you select into your model, the faster your model is trained. Then you can use a larger sample size to improve your model accuracy score.

In [None]:
# your code here

## Bonus Question 2: Machine Learning Pipeline

In a new Jupyter Notebook, combine all your codes into a function (or a class). Your new function will execute the complete machine learning pipeline job by receiving the dataset location and output the classifier. This will allow you to use your function to predict the sentiment of any tweet in real time. 

In [None]:
# your code here

## Bonus Question 3: Apache Spark

If you have completed the Apache Spark advanced topic lab, what you can do is to migrate your pipeline from local to a Databricks Notebook. Share your notebook with your instructor and classmates to show off your achievements!

In [None]:
# your code here