<a target="_blank" href="https://colab.research.google.com/github/wbfrench1/barker_DATA606/blob/main/src/NER_w_spaCy.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Draft Proposal

For your project proposal, please use the following questions to guide your writing:

<ol><li> What is your issue of interest (provide sufficient background information)?</li>

  My project will focus on Named Entity Recognition (NER).  My interest in this topic stems from the need to classify and organize the large volumes of interesting, unstructured data that are produced daily to make them usable.  
  - Too much data to read and classify
  - Need a way to find data that is of interest
  - NER is one way to help identify the key data in a document 

 <li>Why is this issue important to you and/or to others?
  - Standard NER tools exist, but are insufficient for specific use cases (e.g. identifying the key components of a movie review)</li>


<li> What questions do you have in mind and would like to answer?</li>
 
 - Can standard NER tools be trained to work on specific data sets?
 - Can a custom NER model outperform a standard NER model that has been trained on specific data sets?
 - How does the performance of trained standard NER compare to the performance of the standard version used on standard documents?
 - What kinds of techniques can I use to improve a given model's performance?
 - Is it possible to validate the NER tools on unlabeled data?
 - What techniques exist for preparing labeled data on an un labeled set?

<li> Where do you get the data to analyze and help answer your questions (creditability of source, quality of data, size of data, attributes of data. etc.)?</li>

 - labeled movie reviews: 
<ol><li> https://groups.csail.mit.edu/sls/downloads/movie/</li>
<li>https://www.kaggle.com/datasets/Cornell-University/movie-dialog-corpus</li></ol>
 - unlabled move reviews:
 <ol><li>https://ai.stanford.edu/~amaas/data/sentiment/</li></ol> 


 <li>What will be your unit of analysis (for example, patient, organization, or country)? Roughly how many units (observations) do you expect to analyze?</li>
 - Units of analysis would be movie reviews
 <li>What variables/measures do you plan to use in your analysis (variables should be tied to the questions in #3)?</li>
 
  - Since this is a classification problem at its core, to gauge the performance of the various NER Models, I will use:
   
   - Precision
   - Recall
   - F-Score
   - Accuracy

<li> What kinds of techniques/models do you plan to use (for example, clustering, NLP, ARIMA, etc.)?</li>
 - NLP techniques and models will be used.  spaCy, Stanford, custom model based on the custom conditional random field model in Text Analytics with python
<li> How do you plan to develop/apply ML and how you evaluate/compare the performance of the models?</li>
<li> What outcomes do you intend to achieve (better understanding of problems, tools to help solve problems, predictive analytics with practicle applications, etc)?</li>
 - I hope to better understand NLP NERs model, how they work, their performance metrics and their performance abilities, what--if any-- techniques I can use to improve their performance 



- project business proposal

  - Literature Search
  - Run the Models
  - Show the results
  - Several Identify techniques for improvement
  - Implement techniques
  - Look at results
  - Appropriate Visualizatio of results
  
- eda -for each data set explain the data. 

  - Show the labels
  - Show the titles and counts

In [None]:
import requests
import pandas as pd
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)

In [None]:
url = 'https://groups.csail.mit.edu/sls/downloads/movie/engtest.bio'
r = requests.get(url, allow_redirects=True)
open('engtest.bio', 'wb').write(r.content)

252636

In [None]:
def fetch_data( str_url: str, str_file_name_out: str):
    '''

    Description:    takes a string form to the url location of a dataset 
                    takes a string form of the filename to write the dataset to
                    saves the file to the working directory under the filename

    parameters:     str_url: a string containing the url of the data
                    str_file_name: a string containing the name the data will be
                       saved under in the current drive
    
    return:         Nothing
    ''' 
    r = requests.get(str_url, allow_redirects=True)
    open(str_file_name_out, 'wb').write(r.content)
    

# Data

## 1. Data Locations and Filenames

In [None]:
str_url_1_train = 'https://groups.csail.mit.edu/sls/downloads/movie/engtrain.bio'
str_file_name_1_train = 'eng_train.bio'

str_url_1_test = 'https://groups.csail.mit.edu/sls/downloads/movie/engtest.bio'
str_file_name_1_test = 'eng_test.bio'

str_url_2_train = 'https://groups.csail.mit.edu/sls/downloads/movie/trivia10k13train.bio'
str_file_name_2_train = 'trivia_train.bio'

str_url_2_test = 'https://groups.csail.mit.edu/sls/downloads/movie/trivia10k13test.bio'
str_file_name_2_test = 'trivia_test.bio'

l_data_urls_and_filenames = [(str_url_1_train, str_file_name_1_train),
                            (str_url_1_test, str_file_name_1_test),
                            (str_url_2_train, str_file_name_2_train),
                            (str_url_2_test, str_file_name_2_test)]

## 2. Data Labeling

In [None]:
def label_data_w_item_number ( str_f_name: str) -> pd.DataFrame:
    '''

    Description:    tags the same label to each word in a question to facilitate
                       identification of unique questions later
                    saves the updated file
    
    Parameters:     str_f_name: takes the filename of one of the files in the 
                       trivia dataset and appends a question number to each word
                       that is part of the same question.
    
    return:         None

    '''

    with open(str_f_name) as f:
        lines = f.readlines()

    i = 1
    with open(str_f_name, 'w') as f_out:
        for line in lines:
            if line != '\n':
                # add a label for each question
                line = str(i) + '\t' + line
                f_out.write(line)
            else:
                i += 1

## 3. Get Data and Save to Local Drive
## 4. Label Data with Question Numbers

In [None]:
for str_url, str_f_name_out in l_data_urls_and_filenames:
    # 3. get data and save
    fetch_data( str_url = str_url,
            str_file_name_out = str_f_name_out)
    # label data with question numbers
    label_data_w_item_number( str_f_name = str_f_name_out)

## 3. Data Labeling

In [None]:
with open('engtest.bio') as f:
    lines = f.readlines()

i = 1

with open('entest_update.bio', 'w') as f_out:
    for line in lines:
        if line != '\n':
            line = str(i) + '\t' + line
            f_out.write(line)
        else:
            i += 1

In [None]:
df = pd.read_csv('engtest.bio', sep='\t', names=['label', 'word'])
df.to_csv('engtest.csv', index=False)
df = pd.read_csv('engtest.csv')

In [None]:
df = pd.read_csv('entest_update.bio', sep='\t', names=['review_num', 'label', 'word'])

In [None]:
df.to_csv('engtest_update.bio.csv', index=False)

In [None]:
df = pd.read_csv('engtest_update.bio.csv')

In [None]:
df.head(20)

Unnamed: 0,review_num,label,word
0,1,O,are
1,1,O,there
2,1,O,any
3,1,O,good
4,1,B-GENRE,romantic
5,1,I-GENRE,comedies
6,1,O,out
7,1,B-YEAR,right
8,1,I-YEAR,now
9,2,O,show


In [None]:
df.label.unique()

array(['O', 'B-GENRE', 'I-GENRE', 'B-YEAR', 'I-YEAR', 'B-PLOT', 'I-PLOT',
       'B-RATINGS_AVERAGE', 'I-RATINGS_AVERAGE', 'B-ACTOR', 'I-ACTOR',
       'B-TITLE', 'I-TITLE', 'B-SONG', 'B-CHARACTER', 'B-RATING',
       'I-RATING', 'B-REVIEW', 'B-DIRECTOR', 'I-DIRECTOR', 'I-REVIEW',
       'I-SONG', 'I-CHARACTER', 'B-TRAILER', 'I-TRAILER'], dtype=object)

B = Begin
I = Intermediate
E = End

In [None]:
df.label.value_counts()

O                    14929
B-GENRE               1117
I-ACTOR                862
I-TITLE                856
B-ACTOR                812
B-YEAR                 720
I-YEAR                 610
B-TITLE                562
B-RATING               500
I-PLOT                 496
I-DIRECTOR             496
B-PLOT                 491
B-DIRECTOR             456
B-RATINGS_AVERAGE      451
I-RATINGS_AVERAGE      403
I-RATING               226
I-GENRE                222
I-SONG                 119
B-CHARACTER             90
I-CHARACTER             75
B-REVIEW                56
B-SONG                  54
I-REVIEW                45
B-TRAILER               30
I-TRAILER                8
Name: label, dtype: int64

In [None]:
df2 = df.copy()

# Separate the label into two data elements: loc

In [None]:
df2.head()

Unnamed: 0,review_num,label,word
0,1,O,are
1,1,O,there
2,1,O,any
3,1,O,good
4,1,B-GENRE,romantic


In [None]:
df2.loc[df2['label'] == 'O', 'label'] = 'O-O'

In [None]:
df2.head()

Unnamed: 0,review_num,label,word
0,1,O-O,are
1,1,O-O,there
2,1,O-O,any
3,1,O-O,good
4,1,B-GENRE,romantic


In [None]:
df2['pos'] = 'O'

In [None]:
df2['pos'] = [label[0] for label in df2.label.str.split('-')]

In [None]:
df2['label'] = [label[1] for label in df2.label.str.split('-')]

# 2443 Reviews

In [None]:
df2['review_num'].max()

2443

In [None]:
df2['label'].unique()

array(['O', 'GENRE', 'YEAR', 'PLOT', 'RATINGS_AVERAGE', 'ACTOR', 'TITLE',
       'SONG', 'CHARACTER', 'RATING', 'REVIEW', 'DIRECTOR', 'TRAILER'],
      dtype=object)

In [None]:
df_label = pd.DataFrame(columns= ['label', 'word'])

In [None]:
df_label

Unnamed: 0,label,word


In [None]:
df2.loc[(df2['review_num'] == 2443)]

Unnamed: 0,review_num,label,word,pos
24675,2443,O,what,O
24676,2443,O,s,O
24677,2443,O,the,O
24678,2443,O,title,O
24679,2443,O,of,O
24680,2443,O,the,O
24681,2443,O,movie,O
24682,2443,O,about,O
24683,2443,CHARACTER,captain,B
24684,2443,CHARACTER,jack,I


In [None]:
' '.join(df2.loc[(df2['review_num'] == 2443)
        & (df2['label'] == 'CHARACTER'), 'word'].tolist())

'captain jack sparrow'

In [None]:
df2.loc[df2['label'] == 'CHARACTER', 'review_num'].unique()

array([  11,   46,   64,   91,  139,  169,  179,  186,  192,  202,  219,
        220,  223,  282,  291,  293,  313,  323,  336,  346,  355,  364,
        367,  389,  403,  420,  424,  441,  474,  476,  491,  499,  516,
        543,  548,  550,  559,  561,  583,  590,  592,  603,  611,  621,
        624,  625,  646,  655,  656,  659,  678,  693,  698,  714,  717,
        727,  728,  738,  750,  771,  795,  865,  894,  903,  922,  924,
        950,  953,  959,  967,  968,  974,  999, 1028, 1038, 1045, 2394,
       2401, 2402, 2403, 2404, 2405, 2441, 2442, 2443])

In [None]:
for index in df2.loc[df2['label'] == 'CHARACTER', 'review_num'].unique():
    print(index)
    word = ' '.join(df2.loc[(df2['review_num'] == index)
                             & (df2['label'] == 'CHARACTER'), 'word'].tolist())
    print(word)
    df_label2 = pd.DataFrame( { 'label': 'CHARACTER', 'word' : word},index=[index])
    df_label = df_label.append([df_label2])

11
detective
46
queen elizabeth
64
jason bourne
91
v
139
james
169
donkey
179
shrek
186
princess fiona
192
james bond
202
jake sully
219
todd
220
brandon teena
223
super man
282
dwight mccarthy
291
darth maul
293
ariels seagull
313
harry potter
323
nostradamous
336
the torch
346
harry potter
355
napoleon dynamite
364
harry potter
367
morpheus
389
leo marvin
403
cop federal agent
420
character
424
monty python
441
andy duphrane
474
hermione granger
476
batman
491
jason bourne
499
dorothys
516
harry potter
543
harry potter
548
ariels
550
dr seuss character
559
chucky
561
character
583
killer
590
sherlock holmes
592
old lady
603
tank
611
air bud
621
chucky
624
zombie tiger
625
captain january
646
kermit miss piggy
655
sabrina
656
freddy kruger
659
on what film did samuel l jackson first appears as nick fury
678
michael jackson
693
tiger
698
wyatt earp
714
ferris bueller
717
leonidas
727
langston hughs
728
wooster
738
freddy
750
r2d2
771
jeeves
795
jackie robinson
865
harry potter
894
pope

In [None]:
df_label

Unnamed: 0,label,word
11,CHARACTER,detective
46,CHARACTER,queen elizabeth
64,CHARACTER,jason bourne
91,CHARACTER,v
139,CHARACTER,james
169,CHARACTER,donkey
179,CHARACTER,shrek
186,CHARACTER,princess fiona
192,CHARACTER,james bond
202,CHARACTER,jake sully


In [None]:
df_label2 = pd.DataFrame( { 'label': 'CHARACTER', 'word' : word},index=[index])

In [None]:
df_label.value_counts()

label      word                                                        
CHARACTER  harry potter                                                    6
           james bond                                                      3
           indiana jones                                                   2
           monty python                                                    2
           jason bourne                                                    2
           dr evil                                                         2
           character                                                       2
           chucky                                                          2
           old lady                                                        1
           queen elizabeth                                                 1
           princess fiona                                                  1
           popeye doyle                                                    1
    