## Machine Learning Pipeline
1. Load data
    1. Use train/test split
2. Use Feature engineering algorithm (vectorising with TF-IDF)
3. Run a Logistic Regression
    1. Extract classification_report
    2. Tune model to meet expectations for accuracy, precision, recall
    3. Extract correctly classified results
4. Run NLP on correctly classified results
5. Extract `ORGANISATIONS` (companies) and `PERSON` (CEOs) and save to file.
6. Crack a corona.

In [1]:
import pandas as pd
ceo_df = pd.read_csv('ceo_df.csv')
company_df = pd.read_csv('company_df.csv')

### Extracting Companies

In [2]:
from sklearn.model_selection import train_test_split

X_comp = company_df.drop('company_label',axis=1)
y_comp = company_df.company_label

Xcomp_train, Xcomp_test, ycomp_train, ycomp_test = train_test_split(X_comp, y_comp, test_size=0.4, random_state=1)

### Vectorise with TF-IDF

Given the poor performance with selected/computed features, instead I will use TF-IDF to algorithmically create 10,000 new features. God bless computational increases.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [4]:
%%time
tfidf_company = TfidfVectorizer(ngram_range=(1,3),stop_words='english',
                           max_features=10000).fit(Xcomp_train.text_organisations)

CPU times: user 45.5 s, sys: 1.85 s, total: 47.3 s
Wall time: 54 s


In [5]:
%%time
company_text_train = tfidf_company.transform(Xcomp_train.text_organisations)
company_text_test = tfidf_company.transform(Xcomp_test.text_organisations)

CPU times: user 23.4 s, sys: 284 ms, total: 23.7 s
Wall time: 24.1 s


In [6]:
%%time
Xcomp_train = hstack([Xcomp_train.drop("text_organisations", axis=1), company_text_train])
Xcomp_test = hstack([Xcomp_test.drop("text_organisations",axis=1), company_text_test])

CPU times: user 175 ms, sys: 65.9 ms, total: 241 ms
Wall time: 241 ms


In [7]:
Xcomp_train.shape

(403045, 10003)

In [8]:
Xcomp_test.shape

(268698, 10003)

In [9]:
%%time
lr_org = LogisticRegression(verbose=1,max_iter=5000, class_weight='balanced')
lr_org.fit(Xcomp_train, ycomp_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


CPU times: user 8min 17s, sys: 1min 59s, total: 10min 16s
Wall time: 1min 32s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.5min finished


In [10]:
ycomp_train_pred = lr_org.predict(Xcomp_train)
print(classification_report(ycomp_train, ycomp_train_pred))

              precision    recall  f1-score   support

           0       0.97      0.96      0.97    323598
           1       0.84      0.90      0.87     79447

    accuracy                           0.95    403045
   macro avg       0.91      0.93      0.92    403045
weighted avg       0.95      0.95      0.95    403045



In [11]:
ycomp_test_pred = lr_org.predict(Xcomp_test)
print(classification_report(ycomp_test, ycomp_test_pred))

              precision    recall  f1-score   support

           0       0.97      0.96      0.96    215769
           1       0.83      0.89      0.86     52929

    accuracy                           0.94    268698
   macro avg       0.90      0.92      0.91    268698
weighted avg       0.94      0.94      0.94    268698



In [12]:
print('ycomp_test_pred len is {}'.format(len(ycomp_test_pred)))
print('ycomp_test_pred len is {}'.format(len(ycomp_train_pred)))

ycomp_test_pred len is 268698
ycomp_test_pred len is 403045


In [13]:
classifiedtrain_companies = ycomp_train[(ycomp_train == ycomp_train_pred) & (ycomp_train == 1)]
classifiedtest_companies = ycomp_test[(ycomp_test == ycomp_test_pred) & (ycomp_test == 1)]

In [14]:
try_one = pd.concat([classifiedtrain_companies, classifiedtest_companies], ignore_index=True)

In [15]:
companies_returned = pd.merge(company_df, try_one, left_index=True, right_index=True)

In [16]:
companies_returned.head()

Unnamed: 0,text_organisations,num_words,num_capitals,company_nearby,company_label_x,company_label_y
0,Earlier today we had a strong South Korean PMI...,10,6,0,0,1
1,The latest?,2,1,0,0,1
2,It just saw a rise in December PMI from 47.4 t...,12,5,0,0,1
3,From the report: With the House prepared to v...,40,19,1,0,1
4,Here's the banner leading Drudge Report right ...,16,6,0,0,1


In [17]:
i = companies_returned['text_organisations']
len(i)

118025

In [18]:
def matching_engine_3000(values,match):
    '''
    values is the nlp processed docs
    match is the entity type to match e.g. PERSON
    '''
    engine_results = [ent.text for ent in values.ents if ent.label_ == match]
    return engine_results

In [19]:
from tqdm import tqdm
import spacy

nlp = spacy.load("en_core_web_sm", disable=['tagger','parser'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

iq=[]
for s in tqdm(i[0:118025]):
    iq.append(nlp(s))

100%|██████████| 118025/118025 [08:28<00:00, 232.11it/s]


In [20]:
%%time
final_companies=[]

for i in tqdm(iq):
    final_companies.extend(matching_engine_3000(i,"ORG"))

100%|██████████| 118025/118025 [00:00<00:00, 134218.32it/s]

CPU times: user 834 ms, sys: 227 ms, total: 1.06 s
Wall time: 882 ms





In [21]:
print(len(final_companies))
print(len(set(final_companies)))

final_companies = list(set(final_companies))

55737
16544


In [22]:
final_companies_list = pd.DataFrame(final_companies)
final_companies_list.to_csv('final_companies_list.csv',index=False)

In [23]:
ceo_df.head()

Unnamed: 0,text_persons,num_capitals,num_words,ceo_label
0,From the report: With the House prepared to v...,19,40,1
1,Here's the banner leading Drudge Report right ...,6,16,0
2,South Korea -- whose heavy reliance on global ...,5,29,0
3,From the report: The HSBC South Korea Purchasi...,17,31,0
4,Markit UPDATE: As the Fiscal Cliff bill gets c...,11,31,0


### Extracting CEOs

In [24]:
from sklearn.model_selection import train_test_split
X = ceo_df.drop('ceo_label',axis=1)
y = ceo_df.ceo_label

Xceo_train, Xceo_test, yceo_train, yceo_test = train_test_split(X, y, test_size=0.4, random_state=1)

In [25]:
%%time
tfidf_ceo = TfidfVectorizer(ngram_range=(1,3),stop_words='english',
                           max_features=10000).fit(Xceo_train.text_persons)

CPU times: user 23.5 s, sys: 895 ms, total: 24.4 s
Wall time: 24.7 s


In [26]:
%%time
ceo_text_train = tfidf_ceo.transform(Xceo_train.text_persons)
ceo_text_test = tfidf_ceo.transform(Xceo_test.text_persons)

CPU times: user 12.2 s, sys: 101 ms, total: 12.3 s
Wall time: 12.3 s


In [27]:
%%time
Xceo_train = hstack([Xceo_train.drop("text_persons", axis=1), ceo_text_train])
Xceo_test = hstack([Xceo_test.drop("text_persons",axis=1), ceo_text_test])

CPU times: user 88.1 ms, sys: 42.6 ms, total: 131 ms
Wall time: 129 ms


In [28]:
Xceo_train.shape

(209720, 10002)

In [29]:
Xceo_test.shape

(139814, 10002)

In [30]:
%%time
lr_ceo = LogisticRegression(verbose=1,max_iter=5000, class_weight='balanced')
lr_ceo.fit(Xceo_train, yceo_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


CPU times: user 3min 19s, sys: 33.2 s, total: 3min 52s
Wall time: 33.6 s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   33.5s finished


In [31]:
yceo_train_pred = lr_ceo.predict(Xceo_train)
print(classification_report(yceo_train, yceo_train_pred))

              precision    recall  f1-score   support

           0       0.99      0.95      0.97    192915
           1       0.62      0.91      0.74     16805

    accuracy                           0.95    209720
   macro avg       0.81      0.93      0.86    209720
weighted avg       0.96      0.95      0.95    209720



In [32]:
yceo_test_pred = lr_ceo.predict(Xceo_test)
print(classification_report(yceo_test, yceo_test_pred))

              precision    recall  f1-score   support

           0       0.99      0.95      0.97    128774
           1       0.59      0.86      0.70     11040

    accuracy                           0.94    139814
   macro avg       0.79      0.90      0.83    139814
weighted avg       0.96      0.94      0.95    139814



In [33]:
print('ycomp_test_pred len is {}'.format(len(yceo_test_pred)))
print('ycomp_test_pred len is {}'.format(len(yceo_train_pred)))

ycomp_test_pred len is 139814
ycomp_test_pred len is 209720


In [34]:
classifiedtrain_ceo = yceo_train[(yceo_train == yceo_train_pred) & (yceo_train == 1)]
classifiedtest_ceo = yceo_test[(yceo_test == yceo_test_pred) & (yceo_test == 1)]

In [35]:
try_two = pd.concat([classifiedtrain_ceo, classifiedtest_ceo], ignore_index=True)

In [36]:
ceos_returned = pd.merge(ceo_df, try_two, left_index=True, right_index=True)

In [37]:
ceos_returned.head()

Unnamed: 0,text_persons,num_capitals,num_words,ceo_label_x,ceo_label_y
0,From the report: With the House prepared to v...,19,40,1,1
1,Here's the banner leading Drudge Report right ...,6,16,0,1
2,South Korea -- whose heavy reliance on global ...,5,29,0,1
3,From the report: The HSBC South Korea Purchasi...,17,31,0,1
4,Markit UPDATE: As the Fiscal Cliff bill gets c...,11,31,0,1


In [38]:
i = ceos_returned['text_persons']
len(i)

24769

In [39]:
iq=[]
for s in tqdm(i[0:24769]):
    iq.append(nlp(s))

100%|██████████| 24769/24769 [01:50<00:00, 224.47it/s]


In [40]:
final_ceos=[]

for i in tqdm(iq):
    final_ceos.extend(matching_engine_3000(i,"PERSON"))

100%|██████████| 24769/24769 [00:00<00:00, 146454.56it/s]


In [41]:
print(len(final_ceos))
print(len(set(final_ceos)))

final_ceos = list(set(final_ceos))

13608
5919


In [42]:
final_ceos_list = pd.DataFrame(final_ceos)
final_ceos_list.to_csv('final_ceos_list.csv',index=False)

In [43]:
final_ceos_list

Unnamed: 0,0
0,Mike Derchin
1,Charlie Devereuz
2,Davos Citi
3,Mohamed Morsy
4,Stephen Bird
...,...
5914,Matt Tucker
5915,Li Keqiang
5916,JimBianco
5917,Wynn
