# Assignment 5

You have been provided with a pickle file, containing the 100 news articles about Caterpillar.  Identify what companies are mentioned most frequently in the news along with Caterpillar. 


- Discard non-English results
- Identify what companies are mentioned most frequently along with Caterpillar (in both title and the body of the article)
- Show a table or chart with your top-20 companies (sorted in the descending order)
    - I suggest two separate tables: top-20 based on title and top-20 based on body of the article
- Use a couple of different NER packages and options, (i.e. both NLTK and SpaCy, also with and without sentence segmentation).  This way you can evaluate which model provided you the best results
- Your top-20 list should only be based on your most accurate results from the best performing NER package

In [1]:
###Loading Packages###
import pandas as pd  # data frame operations  
import numpy as np  # arrays and math functions

import re # regular expressions
import os # Operation System
from datetime import datetime

import nltk as nltk
import nltk.corpus  
from nltk.text import Text
import sys

import warnings
warnings.filterwarnings("ignore")

from textblob import TextBlob

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import metrics

import spacy

In [2]:
data = pd.read_pickle('news_cat.pkl')

In [3]:
data

Unnamed: 0,crawled,language,text,title
0,2018-01-30T23:03:51.004+02:00,english,by Abhishek K Global Telehandler Market 2023 D...,Global Telehandler Market 2023 Demand by Segme...
1,2018-01-30T23:06:46.024+02:00,english,favorite this post 2014 Caterpillar 314E LCR h...,2014 Caterpillar 314E LCR
2,2018-01-30T23:18:35.023+02:00,english,By: MAX NISEN The Amazon health care threat ha...,"Amazon, Berkshire, JPMorgan health announcemen..."
3,2018-01-30T23:20:54.012+02:00,english,QR Code Link to This Post MONTHLY PUBLIC AUCTI...,2005 Caterpillar CB534D Tandem Vibratory Rolle...
4,2018-01-30T23:28:30.000+02:00,english,QR Code Link to This Post 2007 CATERPILLAR D4G...,2007 CATERPILLAR D4G LGP CAB SCREEN/SWEEPS - O...
...,...,...,...,...
95,2018-01-31T17:00:58.003+02:00,english,What to Expect From Caterpillar Inc. in 2018 J...,What to Expect From Caterpillar Inc. in 2018
96,2018-01-31T17:06:20.008+02:00,english,transmission: other QR Code Link to This Post ...,Like to trade my Mitsubishi Excavator for a Mo...
97,2018-01-31T17:23:34.002+02:00,english,One year after Caterpillar's headquarters anno...,One year after Caterpillar's headquarters anno...
98,2018-01-31T17:26:56.003+02:00,english,"1,613 Shares in Caterpillar Inc. (NYSE:CAT) Pu...","1,613 Shares in Caterpillar Inc. (NYSE:CAT) Pu..."


## 1. Discard Non-english Results

In [4]:
data_cleaned = data[data['language'] == 'english']

In [5]:
data_cleaned

Unnamed: 0,crawled,language,text,title
0,2018-01-30T23:03:51.004+02:00,english,by Abhishek K Global Telehandler Market 2023 D...,Global Telehandler Market 2023 Demand by Segme...
1,2018-01-30T23:06:46.024+02:00,english,favorite this post 2014 Caterpillar 314E LCR h...,2014 Caterpillar 314E LCR
2,2018-01-30T23:18:35.023+02:00,english,By: MAX NISEN The Amazon health care threat ha...,"Amazon, Berkshire, JPMorgan health announcemen..."
3,2018-01-30T23:20:54.012+02:00,english,QR Code Link to This Post MONTHLY PUBLIC AUCTI...,2005 Caterpillar CB534D Tandem Vibratory Rolle...
4,2018-01-30T23:28:30.000+02:00,english,QR Code Link to This Post 2007 CATERPILLAR D4G...,2007 CATERPILLAR D4G LGP CAB SCREEN/SWEEPS - O...
...,...,...,...,...
95,2018-01-31T17:00:58.003+02:00,english,What to Expect From Caterpillar Inc. in 2018 J...,What to Expect From Caterpillar Inc. in 2018
96,2018-01-31T17:06:20.008+02:00,english,transmission: other QR Code Link to This Post ...,Like to trade my Mitsubishi Excavator for a Mo...
97,2018-01-31T17:23:34.002+02:00,english,One year after Caterpillar's headquarters anno...,One year after Caterpillar's headquarters anno...
98,2018-01-31T17:26:56.003+02:00,english,"1,613 Shares in Caterpillar Inc. (NYSE:CAT) Pu...","1,613 Shares in Caterpillar Inc. (NYSE:CAT) Pu..."


## 2. Identify what companies are mentioned most frequently along with Caterpillar (in both title and the body of the article)

### 2.1 Title - Using SpaCy

In [6]:
# Load SpaCy model
nlp = spacy.load("en_core_web_md")

**Preprocess the title**

In [96]:
result = []
for i in range(0, 100):
    text = data['title'][i]
    text = re.sub("\\n", "", text)
    doc = nlp(text)

    entities = []
    labels = []
    position_start = []
    position_end = []

    for ent in doc.ents:
        entities.append(ent.text)
        labels.append(ent.label_)
        position_start.append(ent.start_char)
        position_end.append(ent.end_char)

    df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})

    orgs = df[df['Labels'] == 'ORG']
    
    for en in orgs['Entities']:
        result.append(en)

In [98]:
# Lowercase all words (default_stopwords are lowercase too)
# result = [word.lower() for word in result]

fdist_title_spacy = nltk.FreqDist(result)

print(fdist_title_spacy)

#fdist.items() #- will give all words
fdist_title_spacy.most_common(20)

<FreqDist with 78 samples and 131 outcomes>


[('Caterpillar Inc.', 15),
 ('CAT', 14),
 ('Caterpillar', 13),
 ('CYCLE', 8),
 ('NYSE', 4),
 ('Ephrata', 3),
 ('Amazon', 2),
 ('CATERPILLAR', 2),
 ('Global Telehandler Market 2023', 1),
 ('Swot Analysis', 1),
 ('Major Customer Survey & Demand Forecast', 1),
 ('Caterpillar 314E', 1),
 ('JPMorgan', 1),
 ('Chicago Business', 1),
 ('Caterpillar CB534D Tandem Vibratory Roller', 1),
 ('Elite Wealth Management Inc. Acquires', 1),
 ('Kayzo, Alesso & More Headline Beyond Wonderland', 1),
 ('Caterpillar T40D LP Forklift Tow', 1),
 ('Claymont', 1),
 ('Favorable News Coverage Somewhat Unlikely to Impact Caterpillar', 1)]

In [99]:
title_list_spacy, title_freq_spacy = zip(*fdist_title_spacy.most_common(20))
title_list_spacy = list(title_list_spacy)
title_freq_spacy = list(title_freq_spacy)

### 2.2 Title - Using NLTK

In [70]:
entities = []
labels = []

for text in data['title']:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
            labels.append(chunk.label())

entities_labels = list(zip(entities, labels))
#entities_labels = list(set(zip(entities, labels))) #unique entities

In [71]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Global,PERSON
1,Telehandler,ORGANIZATION
2,Demand,GPE
3,Segment,ORGANIZATION
4,Swot Analysis,PERSON
...,...,...
224,Summit X,PERSON
225,Caterpillar Inc.,ORGANIZATION
226,CAT,ORGANIZATION
227,Holdings Cut,ORGANIZATION


In [72]:
title_nltk = entities_df[entities_df['Labels'] == 'ORGANIZATION']
title_nltk

Unnamed: 0,Entities,Labels
1,Telehandler,ORGANIZATION
3,Segment,ORGANIZATION
6,Demand Forecast,ORGANIZATION
8,Berkshire,ORGANIZATION
9,JPMorgan,ORGANIZATION
...,...,...
222,Caterpillar Inc.,ORGANIZATION
223,NYSE,ORGANIZATION
225,Caterpillar Inc.,ORGANIZATION
226,CAT,ORGANIZATION


In [73]:
title_nltk = title_nltk.groupby(['Entities'])['Entities'].count().reset_index(name='count').sort_values(['count'], ascending=False)                           

In [74]:
title_nltk.head(20)

Unnamed: 0,Entities,count
9,Caterpillar Inc.,13
4,CAT,12
23,GENERATORS,8
10,DIESEL,8
45,NYSE,4
65,Wonderland,3
17,Ephrata,3
26,Giants Form Health Alliance,2
18,FOR,2
59,Segment,1


### 2.3 Text - NLTK (no sentence segmentation)

In [37]:
entities = []
labels = []

for text in data['text']:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
            labels.append(chunk.label())

entities_labels = list(zip(entities, labels))
#entities_labels = list(set(zip(entities, labels))) #unique entities

In [38]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Abhishek K Global Telehandler,PERSON
1,Demand,GPE
2,Segment,ORGANIZATION
3,Swot Analysis,PERSON
4,Major Customer,PERSON
...,...,...
4178,NYSE,ORGANIZATION
4179,Receive News,PERSON
4180,Ratings,ORGANIZATION
4181,Caterpillar,ORGANIZATION


In [39]:
text_nltk = entities_df[entities_df['Labels'] == 'ORGANIZATION']

In [40]:
text_nltk = text_nltk.groupby(['Entities'])['Entities'].count().reset_index(name='count').sort_values(['count'], ascending=False)
text_nltk.head(20)

Unnamed: 0,Entities,count
139,Caterpillar,99
144,Caterpillar Inc.,66
449,NYSE,59
93,CAT,43
137,Cat,36
161,Company,23
569,SEC,23
345,JPMorgan,21
632,Transportation,20
671,Vista Partners,20


### 2.4 Text - NLTK (with sentence segmentation)

In [41]:
entities = []
labels = []

for text in data['text']:
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary = False):
            if hasattr(chunk, 'label'):
                entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
                labels.append(chunk.label())

entities_labels = list(zip(entities, labels))
#entities_labels = list(set(zip(entities, labels))) #unique entities

In [42]:
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
entities_df

Unnamed: 0,Entities,Labels
0,Abhishek K Global Telehandler,PERSON
1,Demand,GPE
2,Segment,ORGANIZATION
3,Swot Analysis,PERSON
4,Major Customer,PERSON
...,...,...
4456,Receive,PERSON
4457,News,ORGANIZATION
4458,Ratings,ORGANIZATION
4459,Caterpillar,ORGANIZATION


In [43]:
text_nltk2 = entities_df[entities_df['Labels'] == 'ORGANIZATION']

In [44]:
text_nltk2 = text_nltk2.groupby(['Entities'])['Entities'].count().reset_index(name='count').sort_values(['count'], ascending=False)
text_nltk2.head(20)

Unnamed: 0,Entities,count
144,Caterpillar,87
149,Caterpillar Inc.,83
466,NYSE,59
97,CAT,43
142,Cat,36
166,Company,27
596,SEC,23
359,JPMorgan,21
246,Exchange Commission,20
666,Transportation,20


### 2.5 Text - SpaCy (no sentence segmentation)

In [45]:
result = []
for i in range(0, 100):
    doc = nlp(data['text'][i])

    entities = []
    labels = []
    position_start = []
    position_end = []

    for ent in doc.ents:
        entities.append(ent.text)
        labels.append(ent.label_)
        position_start.append(ent.start_char)
        position_end.append(ent.end_char)

    df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})

    orgs = df[df['Labels'] == 'ORG']
    
    for en in orgs['Entities']:
        result.append(en)

In [48]:
# Lowercase all words (default_stopwords are lowercase too)
# result = [word.lower() for word in result]

fdist_text_spacy = nltk.FreqDist(result)

print(fdist_text_spacy)

#fdist.items() #- will give all words
fdist_text_spacy.most_common(20)

<FreqDist with 868 samples and 2368 outcomes>


[('Caterpillar', 494),
 ('Caterpillar Inc.', 86),
 ('CAT', 82),
 ('NYSE', 59),
 ('Amazon', 24),
 ('SEC', 23),
 ('Company', 23),
 ('Vista', 22),
 ('Resource Industries', 19),
 ('Cat', 19),
 ('Energy & Transportation', 18),
 ('Construction Industries', 17),
 ('Financial Products', 17),
 ('Rolling Stock', 16),
 ('Citigroup', 13),
 ('Caterpillar Inc', 13),
 ('Dow', 13),
 ('The Lincolnian Online', 12),
 ('Apple', 12),
 ('Vista Partners', 12)]

In [49]:
text_list_spacy, text_freq_spacy = zip(*fdist_text_spacy.most_common(20))
text_list_spacy = list(text_list_spacy)
text_freq_spacy = list(text_freq_spacy)

## 3.Compare the results

In [100]:
title_spacy_top20 = pd.DataFrame(list(zip(title_list_spacy, title_freq_spacy)), columns=['SpaCy_Entities', 'SpaCy_Count'])
title_nltk_top20 = title_nltk.head(20).reset_index()
title_nltk_top20 = title_nltk_top20.drop(columns=['index'])

In [101]:
title_result = pd.concat([title_spacy_top20, title_nltk_top20], axis=1)
title_result = title_result.rename(columns={"Entities": "NLTK_Entities", "count": "NLTK_Count"})

In [102]:
title_result

Unnamed: 0,SpaCy_Entities,SpaCy_Count,NLTK_Entities,NLTK_Count
0,Caterpillar Inc.,15,Caterpillar Inc.,13
1,CAT,14,CAT,12
2,Caterpillar,13,GENERATORS,8
3,CYCLE,8,DIESEL,8
4,NYSE,4,NYSE,4
5,Ephrata,3,Wonderland,3
6,Amazon,2,Ephrata,3
7,CATERPILLAR,2,Giants Form Health Alliance,2
8,Global Telehandler Market 2023,1,FOR,2
9,Swot Analysis,1,Segment,1


I would say the both result from SpaCy and NLTK for title are not quite good. (Here since title are just 1 sentence, I did not use sentence segmentation for NLTK here.) Both results contain many words that are not company names.

The reason for this might be that:
- title might not be a complete sentence, thus sometimes hard to define POS
- Also there are many titles with all words capitalized, which might influence the result greatly as well


In [87]:
text_spacy_top20 = pd.DataFrame(list(zip(text_list_spacy, text_freq_spacy)), 
                                columns=['SpaCy_Entities', 'SpaCy_Count'])

text_nltk_top20 = text_nltk.head(20).reset_index()
text_nltk_top20 = text_nltk_top20.drop(columns=['index'])
text_nltk_top20 = text_nltk_top20.rename(columns={"Entities": "NLTK_Entities", "count": "NLTK_Count"})
text_nltk_seg_top20 = text_nltk2.head(20).reset_index()
text_nltk_seg_top20 = text_nltk_seg_top20.drop(columns=['index'])
text_nltk_seg_top20 = text_nltk_seg_top20.rename(columns={"Entities": "NLTK_SEG_Entities", "count": "NLTK_SEG_Count"})

In [90]:
text_result = pd.concat([text_spacy_top20, text_nltk_top20, text_nltk_seg_top20], axis=1)
text_result

Unnamed: 0,SpaCy_Entities,SpaCy_Count,NLTK_Entities,NLTK_Count,NLTK_SEG_Entities,NLTK_SEG_Count
0,Caterpillar,494,Caterpillar,99,Caterpillar,87
1,Caterpillar Inc.,86,Caterpillar Inc.,66,Caterpillar Inc.,83
2,CAT,82,NYSE,59,NYSE,59
3,NYSE,59,CAT,43,CAT,43
4,Amazon,24,Cat,36,Cat,36
5,SEC,23,Company,23,Company,27
6,Company,23,SEC,23,SEC,23
7,Vista,22,JPMorgan,21,JPMorgan,21
8,Resource Industries,19,Transportation,20,Exchange Commission,20
9,Cat,19,Vista Partners,20,Transportation,20


### Conclusion:

Based on the above tables, companies that are mentioned most frequently along with Caterpillar are:
- CYCLE, DIESEL, NYSE, Amazon, JPMorgan, Mitsubishi. (for title)
- NYSE, SEC, Vista, JPMorgan, Apple, Lincolnian Online. (for text)

From the above results for both titles and text, I would choose the **SpaCy** as the best performing NER package.

Even though the result for titles are not very reasonable, the results for text contain mostly reasonable company names, while the result from both NTLK with and without sentence segmentation still contains some entities that does not make sense. (E.g. NOT, LLC) Also, some different representations of the same companies are counted separately in NLTK while SpaCy gives a more concentrated result (E.g. Caterpillar.)