## Introduction:

In this project, we are going to use stanford NER model to check if the name in GKG dataframe is an organization or not. Here is some basic set-up information, you can do it in your own way.

## Basic info and requirement:

* Output file should be in __.csv__ file.
* Format of output dataframe should contain 2 columns, __original_name, cleaned_name__. Here original_name and cleaned_name are the column from input file, but we will decision by creating a new column which is called __decision__ and we only care about decision == __Yes__. Check this out in the following code.
* Each input file contains 500k rows, the generating speed is based on the hardware of your virtual machine, don't run too much rows in one time.(which means you can only take 1000 or 2000 rows and generating the decision)
* Save your result all the time since virtual machine might lose connection(You will definitly see this, always save your result after each run)
* This jupyter notebook is just a template, which gave you some basic idea how to use these packages, you can write your own loop to generate result.

In [1]:
# packages
import pandas as pd
import numpy as np
import boto
import boto3
import nltk
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

In [2]:
# import java model which we downloaded
# check working directory first (in terminal use pwd to check directory, should be fine if your folder name is GKGPreProcessing)
st = StanfordNERTagger('/home/ec2-user/GKGPreprocessing/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
                       '/home/ec2-user/GKGPreprocessing/stanford-ner-2018-10-16/stanford-ner-3.9.2.jar',
                       encoding='utf-8')

In [3]:
# you just need download this once.
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# file path which is the place we store input dataframe
file_path = 's3://gkgpreprocessing/'

In [5]:
# set up s3 client and get files inside bucket
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('gkgpreprocessing')
file_name = []
for obj in my_bucket.objects.all():
    file_name.append(obj.key)

In [6]:
# check files
file_name

['df2015TO2017_result.csv',
 'df_1.csv',
 'df_10.csv',
 'df_100.csv',
 'df_101.csv',
 'df_102.csv',
 'df_103.csv',
 'df_104.csv',
 'df_105.csv',
 'df_106.csv',
 'df_107.csv',
 'df_108.csv',
 'df_109.csv',
 'df_11.csv',
 'df_110.csv',
 'df_111.csv',
 'df_112.csv',
 'df_113.csv',
 'df_114.csv',
 'df_115.csv',
 'df_116.csv',
 'df_117.csv',
 'df_118.csv',
 'df_119.csv',
 'df_12.csv',
 'df_120.csv',
 'df_121.csv',
 'df_122.csv',
 'df_123.csv',
 'df_124.csv',
 'df_125.csv',
 'df_126.csv',
 'df_127.csv',
 'df_128.csv',
 'df_129.csv',
 'df_13.csv',
 'df_130.csv',
 'df_131.csv',
 'df_132.csv',
 'df_133.csv',
 'df_134.csv',
 'df_135.csv',
 'df_136.csv',
 'df_137.csv',
 'df_138.csv',
 'df_139.csv',
 'df_14.csv',
 'df_140.csv',
 'df_141.csv',
 'df_15.csv',
 'df_16.csv',
 'df_17.csv',
 'df_18.csv',
 'df_19.csv',
 'df_2.csv',
 'df_20.csv',
 'df_21.csv',
 'df_22.csv',
 'df_23.csv',
 'df_24.csv',
 'df_25.csv',
 'df_26.csv',
 'df_27.csv',
 'df_28.csv',
 'df_29.csv',
 'df_3.csv',
 'df_30.csv',
 'df_31.c

In [7]:
# Since the first file is the file which contains all the info, we don't need use it.
file_name = file_name[1:]

In [8]:
# We run a test on df_101.csv
new_df = pd.read_csv(file_path + 'df_101.csv',index_col='Unnamed: 0')

In [9]:
new_df.head(10)

Unnamed: 0,original_name,cleaned_name
0,series at chipstead sailing clubamerican civil...,series at chipstead sailing clubamerican civil...
1,serious collision investigation teamap member ...,serious collision investigation teamap member ...
2,serious crime divisioncamp verde marshal office,serious crime divisioncamp verde marshal office
3,serious crime divisiondepartment of homeland s...,serious crime divisiondepartment of homeland s...
4,serious crime divisioninternational refugee as...,serious crime divisioninternational refugee as...
5,serpay folklore group,serpay folklore group
6,serta motion essentials adjustable,serta motion essentials adjustable foundation
7,service above self registration,service above self registration
8,service agent kerry ogrady facing discipline d...,service agent kerry ogrady facing discipline d...
9,service agreement timatinas biopharma holdings,service agreement timatinas biopharma holdings...


In [10]:
len(new_df)

500000

## File descriptions

Because we need to use pandas, since we have memory issue, we cannot use the whole dataframe, that is why we have 141 dataframes and each dataframe contains 500k rows.

I will run a example only take 100 rows.

In [11]:
# we take 100 to 199 rows as a sample dataframe
sample_df = new_df.loc[100:199]

In [12]:
# function that help to decision if the result is an org or not
def getDecision(input_list):
    for item in input_list:
        if item[1] == 'ORGANIZATION':
            return "Yes"
            break
        
    return "No"

In [13]:
def ner_result(input_str):
    new_str = ' '.join([w.capitalize() for w in input_str.split(' ')])
    tokenized_text = word_tokenize(new_str)
    classified_text = st.tag(tokenized_text)
    return getDecision(classified_text)

In [14]:
import datetime
# Here I'm using m5.12xlarge virtual machine, and I'll use datetime to show you the estimate running time.
# You can also add several rows in your code to give a count down.

In [15]:
result_df = sample_df.copy()
initial_time = datetime.datetime.now()
result_df['Decision'] = result_df['cleaned_name'].apply(ner_result)
print((datetime.datetime.now() - initial_time).total_seconds())

162.047487


In [16]:
result_df.head(10)

Unnamed: 0,original_name,cleaned_name,Decision
100,services indicestwitter,services indicestwitter,No
101,services league ivan venningfinancial modellin...,services league ivan venningfinancial modellin...,Yes
102,services league ivan venninggeneva refugee con...,services league ivan venninggeneva refugee con...,No
103,services league ivan venningomnia ibrahim egyp...,services league ivan venningomnia ibrahim egyp...,No
104,services login,services login,No
105,services mainbehavior blocker,services mainbehavior blocker,No
106,services maincentral intelligence agency,services maincentral intelligence agency,Yes
107,services maindistrict reserve groupabu dhabi n...,services maindistrict reserve groupabu dhabi n...,Yes
108,services mainsociety toreuters,services mainsociety toreuters,No
109,services mainunited statesqueen collegedepartm...,services mainunited statesqueen collegedepartm...,Yes


In [18]:
output_df = result_df[result_df.Decision == 'Yes']

In [19]:
output_df.head(10)

Unnamed: 0,original_name,cleaned_name,Decision
101,services league ivan venningfinancial modellin...,services league ivan venningfinancial modellin...,Yes
106,services maincentral intelligence agency,services maincentral intelligence agency,Yes
107,services maindistrict reserve groupabu dhabi n...,services maindistrict reserve groupabu dhabi n...,Yes
109,services mainunited statesqueen collegedepartm...,services mainunited statesqueen collegedepartm...,Yes
111,services public relationsdisa courtschmidt nat...,services public relationsdisa courtschmidt nat...,Yes
112,services public relationsmayo performing arts ...,services public relationsmayo performing arts ...,Yes
120,services taxforeign ministry for american affa...,services taxforeign ministry for american affa...,Yes
124,services taxjames walker group,services taxjames walker group,Yes
127,services taxstag companyassetmark,services taxstag companyassetmark inc,Yes
133,services viagraafghan for strategic,services viagraafghan institute for strategic,Yes


## Only output the first two column which is original_name and cleaned_name

In [20]:
output_df = output_df[['original_name','cleaned_name']]

In [21]:
output_df.head(10)
# REMEMBER TO SAVE FILE

Unnamed: 0,original_name,cleaned_name
101,services league ivan venningfinancial modellin...,services league ivan venningfinancial modellin...
106,services maincentral intelligence agency,services maincentral intelligence agency
107,services maindistrict reserve groupabu dhabi n...,services maindistrict reserve groupabu dhabi n...
109,services mainunited statesqueen collegedepartm...,services mainunited statesqueen collegedepartm...
111,services public relationsdisa courtschmidt nat...,services public relationsdisa courtschmidt nat...
112,services public relationsmayo performing arts ...,services public relationsmayo performing arts ...
120,services taxforeign ministry for american affa...,services taxforeign ministry for american affa...
124,services taxjames walker group,services taxjames walker group
127,services taxstag companyassetmark,services taxstag companyassetmark inc
133,services viagraafghan for strategic,services viagraafghan institute for strategic
