# Disney Review Dataset - Cleaning

## Introduction

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


Bad key "text.kerning_factor" on line 4 in
/Users/spags/opt/anaconda3/envs/learn-env/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


### Functions

#### Processing Functions

In [2]:
def get_info(df):
    print(df.info())
    print(df.isna().sum())

#### Visualization Functions

## Obtain

### Import Dataset

In [3]:
df = pd.read_csv('DisneylandReviews.csv', engine = 'python')
print(df.shape)
df.head()

(42656, 6)


Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,4,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,4,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,4,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,670607911,4,2019-4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,670607296,4,2019-4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong


## Scrubbing

### Quick Look

In [4]:
get_info(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42656 entries, 0 to 42655
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Review_ID          42656 non-null  int64 
 1   Rating             42656 non-null  int64 
 2   Year_Month         42656 non-null  object
 3   Reviewer_Location  42656 non-null  object
 4   Review_Text        42656 non-null  object
 5   Branch             42656 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.0+ MB
None
Review_ID            0
Rating               0
Year_Month           0
Reviewer_Location    0
Review_Text          0
Branch               0
dtype: int64


> Looks as though everything is fairly normal looking with the exception of the year-month column being a string as opposed to a datetime object.  No null values too, which is nice.

### Review_ID

This column is pretty irrelevant to our analysis, so we can get rid of it.  However, first, let's check and see if there's any weirdness going on.

In [5]:
df['Review_ID'].value_counts()

121586148    2
129231609    2
164830205    2
129207323    2
166787635    2
            ..
127111751    1
359629381    1
609401410    1
184721985    1
322889295    1
Name: Review_ID, Length: 42636, dtype: int64

> Looks like we have some duplicates in here.  Let's get rid of those.

In [6]:
df['Review_ID'].drop_duplicates(keep = 'first', inplace = True)
df['Review_ID'].value_counts()

130078748    1
399349016    1
655889349    1
616889621    1
118578741    1
            ..
278053453    1
639814220    1
126558105    1
639035976    1
322889295    1
Name: Review_ID, Length: 42636, dtype: int64

Okay cool.  Now we can drop the column.

In [7]:
df.drop('Review_ID', axis = 1, inplace = True)

### Rating

Just want to take a quick look through the rating column to make sure everything is on the up and up. 

In [8]:
df['Rating'].value_counts()

5    23146
4    10775
3     5109
2     2127
1     1499
Name: Rating, dtype: int64

> Looks pretty good to me.  Also, most of the ratings are positive...go Disney!

### Year_Month

#### Checking Values

In [9]:
df['Year_Month'].value_counts()

missing    2613
2015-8      786
2015-7      759
2015-12     701
2015-6      692
           ... 
2010-8        7
2010-5        4
2019-5        2
2010-3        2
2010-4        1
Name: Year_Month, Length: 112, dtype: int64

> Bunch of missing values here that we'll need to deal with.  Let's see how we can work with this.  

In [10]:
# Checking the percentage of value_counts

df['Year_Month'].value_counts(1)

missing    0.061258
2015-8     0.018426
2015-7     0.017794
2015-12    0.016434
2015-6     0.016223
             ...   
2010-8     0.000164
2010-5     0.000094
2019-5     0.000047
2010-3     0.000047
2010-4     0.000023
Name: Year_Month, Length: 112, dtype: float64

The 'missing' values represent a little over 6% of the data, which makes it a bit too much for us to want to just remove them.  Let's see if there's a solid way to impute a better value.

In [11]:
for i in df['Year_Month']:
    if i == 'missing':
        df['Year_Month'] = df['Year_Month'].replace(i, np.nan)

In [12]:
df['Year_Month'].isna().sum()

2613

#### Datetime

Let's change this column from a string to a datetime object in case we want to play around with the date in the future.

In [13]:
import datetime as dt

In [14]:
df['Year_Month'] = pd.to_datetime(df['Year_Month'])

In [15]:
df.head()

Unnamed: 0,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,4,2019-04-01,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,4,2019-05-01,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,4,2019-04-01,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,4,2019-04-01,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,4,2019-04-01,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong


#### Month/Year Columns & Null Values

In [16]:
df['Year'] = df['Year_Month'].dt.year
df['Month'] = df['Year_Month'].dt.month
df.drop('Year_Month', axis = 1, inplace = True)

In [18]:
df['Year'].fillna(np.random.choice(df['Year']), inplace = True)
df['Month'].fillna(np.random.choice(df['Month']), inplace = True)

In [19]:
df.isna().sum()

Rating               0
Reviewer_Location    0
Review_Text          0
Branch               0
Year                 0
Month                0
dtype: int64

### Reviewer_Location

Let's go through this column and make sure everything looks okay.

In [21]:
df['Reviewer_Location'].value_counts()

United States          14551
United Kingdom          9751
Australia               4679
Canada                  2235
India                   1511
                       ...  
Antigua and Barbuda        1
Grenada                    1
Senegal                    1
Papua New Guinea           1
Timor-Leste                1
Name: Reviewer_Location, Length: 162, dtype: int64

In [22]:
for i in df['Reviewer_Location'].unique():
    print(i)

Australia
Philippines
United Arab Emirates
United Kingdom
Singapore
India
Malaysia
United States
Canada
Myanmar (Burma)
Hong Kong
China
Indonesia
Qatar
New Zealand
Sri Lanka
Uganda
Thailand
Austria
South Africa
Saudi Arabia
Japan
Israel
South Korea
Turkey
Macau
Egypt
Mexico
Mauritius
Sweden
Brazil
Kenya
Vietnam
Portugal
Cambodia
Zambia
Croatia
France
Taiwan
Oman
Colombia
Norway
Kuwait
Netherlands
Barbados
Finland
Bosnia and Herzegovina
Brunei
Bahrain
Maldives
Ireland
Russia
Romania
Northern Mariana Islands
Germany
Chile
Isle of Man
Pakistan
Ukraine
Greece
Switzerland
Spain
Estonia
C�te d'Ivoire
Guam
Bangladesh
Belgium
Italy
Botswana
Denmark
Argentina
Peru
Lithuania
Iran
Mali
Uruguay
Mongolia
Zimbabwe
Seychelles
Puerto Rico
Hungary
Fiji
Nepal
Jordan
Cyprus
Venezuela
Dominican Republic
Czechia
Bulgaria
Ghana
Ethiopia
The Bahamas
Serbia
Montenegro
Guatemala
Kazakhstan
Poland
Vanuatu
Laos
Cura�ao
Falkland Islands (Islas Malvinas)
Andorra
Haiti
Costa Rica
Nigeria
Jersey
Solomon Islands
Moza

> Other than a few characters that won't print on Python, everything seems to be okay in terms of locations. 

### Branch

Let's just make sure there's no weirdness in this column.

In [23]:
df['Branch'].value_counts()

Disneyland_California    19406
Disneyland_Paris         13630
Disneyland_HongKong       9620
Name: Branch, dtype: int64

> No love for DisneyWorld Orlando, Tokyo Disney or Disneyland Beijing?  Anyway, things look clean here as well.  

### Review_Text

Let's just check out a few reviews to see what we're working with.

In [26]:
df['Review_Text'][0]

"If you've ever been to Disneyland anywhere you'll find Disneyland Hong Kong very similar in the layout when you walk into main street! It has a very familiar feel. One of the rides  its a Small World  is absolutely fabulous and worth doing. The day we visited was fairly hot and relatively busy but the queues moved fairly well. "

In [27]:
df['Review_Text'][12]

'We spend two days, the second day went early then went straight to the back of the park, no lineups for so children got to go on many rides, some twice in a row. This Disneyland is very suitable for young children ours were 7,6,5,4,3 and 1 so most of them could go on all the rides, it was disappointing the castle was closed no nightly fireworks. Would not like to go in the hot season.'

In [28]:
df['Review_Text'][156]

"Last month,my parents,my friends and I went to Hong Kong.We arrived in Hong Kong on February tenth.The weather was cool and sunny.We went to the hotel first,it was clean and the service is good.Almost everything was perfect,the only drawback was the rooms of the hotel were too small that I didn't have enough place to play.After we had a rest ,we went out to a restaurant.The food it's delicious .we had beef balls,roasted goose,shrimp ravioli and some herbal tea.While we were eating,my dad and I saw a store selling iPads next store and since then I had a new iPad.That's really fantastic.We went Disneyland the day after we arrived.We played many recreational events.There were many people there.We waited for a long time to play it.The Roller Coaster was really cool.If there were not too many people,I really wanted to try that again.After we played for a day,every one was tired,we took a small train back to the hotel.We played the rugby,on last day in Hong Kong.It's very interesting.I trie

In [29]:
df['Review_Text'][1000]

"I thought this is the happiest and coolest place on earth, I was wrong. There is limited numbers of rides to choose. they don't offer extreme rides unlike any other theme park. The castle is on going rehabilitation. The parade was good, but if you want and prefer to all rides you will not be satisfied to HK Disneyland."

In [31]:
df['Review_Text'][19000]

"We love coming here. This time it was our wedding anniversary. Everyone had fun helping us celebrate. It was cold rain for most of our trip, but we still found things to do. And now we have nifty ponchos. *sigh*And I got sick, and the First Aide people were very helpful. Don't be afraid to ask for help. They just want to see you feel better and get back to your vacation. Thank you Disney People."

> Aside from some grammatical stuff, the reviews look pretty clean.  

## Save Clean Dataframe

In [32]:
df.head()

Unnamed: 0,Rating,Reviewer_Location,Review_Text,Branch,Year,Month
0,4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong,2019.0,4.0
1,4,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong,2019.0,5.0
2,4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong,2019.0,4.0
3,4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong,2019.0,4.0
4,4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong,2019.0,4.0


In [33]:
df.to_csv('clean_DisneylandReviews.csv')