## Importing Libraries

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy
import re

Loading Spacy Toolkit and reading data through pandas

In [3]:
nlp=spacy.load('en_core_web_sm')

In [44]:
df=pd.read_csv('IMDB Dataset.csv')

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [46]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Function to clean the text; hyperlink removal, square brackets removal, special characters removal

In [47]:
def strip_url(text):
    return re.sub(r"http\S+", '', text)
def strip_brackets(text):
    return re.sub('\[[^]]*\]', '', text)
def strip_spl_char(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
def clean(text):
    text=strip_url(text)
    text=strip_brackets(text)
    text=strip_spl_char(text)
    return text


In [48]:
df['review']=df['review'].apply(clean)

In [49]:
df['review'].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production br br The filmin...
2    I thought this was a wonderful way to spend ti...
3    Basically theres a family where a little boy J...
4    Petter Matteis Love in the Time of Money is a ...
Name: review, dtype: object

In [52]:
df=df.loc[:4999]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     5000 non-null   object
 1   sentiment  5000 non-null   object
dtypes: object(2)
memory usage: 78.2+ KB


Working on 5000 datapoints; Using SpaCy NER, finding labels in reviews which comes under "GPE" and "PERSON" and extracting them to list of tuples.

"GPE"- entity label consists of Country,States and Cities.
Problem statement asks for only cities.

Also, In the "PERSON" entity its mostly not usernames but names of person(usually actors and characters) mentioned in reviews as its highly unlikely for user to mention their own names.

In [68]:
list_ner=[]
def add(text):
    doc=nlp(text)
    for ent in doc.ents:
        if (ent.label_=='GPE'):
            tup=(ent.text,'City')
            list_ner.append(tup)
        elif (ent.label_=='PERSON'):
            tup=(ent.text,'Person')
            list_ner.append(tup)


In [69]:
df['review'].apply(add)

0       None
1       None
2       None
3       None
4       None
        ... 
4995    None
4996    None
4997    None
4998    None
4999    None
Name: review, Length: 5000, dtype: object

In [70]:
list_ner

[('Oz', 'Person'),
 ('Emerald City', 'City'),
 ('Oz', 'Person'),
 ('Oz', 'Person'),
 ('Michael Sheen', 'Person'),
 ('Williams', 'Person'),
 ('Orton', 'Person'),
 ('Halliwell', 'Person'),
 ('Woody Allen', 'Person'),
 ('Scarlet Johanson', 'Person'),
 ('Devil Wears Prada', 'Person'),
 ('Jake', 'Person'),
 ('Jake', 'Person'),
 ('Rambo', 'Person'),
 ('Jake', 'Person'),
 ('Jake', 'City'),
 ('Petter Matteis Love', 'Person'),
 ('Mr Mattei', 'Person'),
 ('Arthur', 'Person'),
 ('New York', 'City'),
 ('Mr Matteis', 'Person'),
 ('Steve Buscemi Rosario Dawson Carol Kane', 'Person'),
 ('Michael Imperioli', 'Person'),
 ('Adrian Grenier', 'Person'),
 ('Mr Mattei', 'Person'),
 ('Paul Lukas', 'Person'),
 ('Bette Davis', 'Person'),
 ('weekYou', 'Person'),
 ('Harvey Keitel', 'City'),
 ('Phil the Alien', 'Person'),
 ('Psycho', 'Person'),
 ('Janet Leigh', 'Person'),
 ('Ralf Moellerbr', 'Person'),
 ('Jack Carver', 'Person'),
 ('Til Schweiger', 'Person'),
 ('Carver', 'City'),
 ('Tils', 'City'),
 ('Udo Kier', 

In [71]:
len(list_ner)

26247

Keeping only unique elements in the list, without losing the order.
Reason: Redundancy of names if mentioned more than once.
Can pe pre-processed more, but quick drafts limits to this.

In [84]:
def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]
final=f7(list_ner)

In [85]:
len(final)

14492

In [86]:
final

[('Oz', 'Person'),
 ('Emerald City', 'City'),
 ('Michael Sheen', 'Person'),
 ('Williams', 'Person'),
 ('Orton', 'Person'),
 ('Halliwell', 'Person'),
 ('Woody Allen', 'Person'),
 ('Scarlet Johanson', 'Person'),
 ('Devil Wears Prada', 'Person'),
 ('Jake', 'Person'),
 ('Rambo', 'Person'),
 ('Jake', 'City'),
 ('Petter Matteis Love', 'Person'),
 ('Mr Mattei', 'Person'),
 ('Arthur', 'Person'),
 ('New York', 'City'),
 ('Mr Matteis', 'Person'),
 ('Steve Buscemi Rosario Dawson Carol Kane', 'Person'),
 ('Michael Imperioli', 'Person'),
 ('Adrian Grenier', 'Person'),
 ('Paul Lukas', 'Person'),
 ('Bette Davis', 'Person'),
 ('weekYou', 'Person'),
 ('Harvey Keitel', 'City'),
 ('Phil the Alien', 'Person'),
 ('Psycho', 'Person'),
 ('Janet Leigh', 'Person'),
 ('Ralf Moellerbr', 'Person'),
 ('Jack Carver', 'Person'),
 ('Til Schweiger', 'Person'),
 ('Carver', 'City'),
 ('Tils', 'City'),
 ('Udo Kier', 'Person'),
 ('mehehe', 'City'),
 ('Shakespeare', 'Person'),
 ('Rev Bowdler', 'Person'),
 ('george clooney'

### Creating required DataFrame and then generating out json file into the directory via function


In [93]:
data=pd.DataFrame(final,columns=['Name','Entity'])
data

Unnamed: 0,Name,Entity
0,Oz,Person
1,Emerald City,City
2,Michael Sheen,Person
3,Williams,Person
4,Orton,Person
...,...,...
14487,Lawrence Oliviers,Person
14488,Li,Person
14489,Tsui,Person
14490,Chris Gore,Person


In [94]:
data.Entity.value_counts()

Person    12900
City       1592
Name: Entity, dtype: int64

In [95]:
with open('result.json', 'w') as f:
    f.write(data.to_json(orient='records', lines=True))