### Prerequisite

Due to file size limitation in GitHub repository, please download the raw data file from [this SharePoint link](https://365umedumy.sharepoint.com/sites/MLGroupAssignment/_layouts/15/download.aspx?UniqueId=866113220dd843709caabce8602b2fbd&e=btGuHc) and place the downloaded `yelp_academic_dataset_review.json` file into the `data` folder before executing the cells below.

### Imports

In [27]:
import numpy as np
import pandas as pd

### Data Collection

#### Read Raw Data

In [2]:
df = pd.DataFrame()

chunks = pd.read_json("../data/yelp_academic_dataset_review.json", orient="records", lines=True, chunksize=1000)
  
for chunk in chunks:
  df = pd.concat([df, chunk])

df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


#### Check Summary

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6990280 entries, 0 to 6990279
Data columns (total 9 columns):
 #   Column       Dtype         
---  ------       -----         
 0   review_id    object        
 1   user_id      object        
 2   business_id  object        
 3   stars        int64         
 4   useful       int64         
 5   funny        int64         
 6   cool         int64         
 7   text         object        
 8   date         datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(4)
memory usage: 480.0+ MB


#### Check Descriptive Statistics

In [4]:
df.describe()

Unnamed: 0,stars,useful,funny,cool
count,6990280.0,6990280.0,6990280.0,6990280.0
mean,3.748584,1.184609,0.3265596,0.4986175
std,1.478705,3.253767,1.688729,2.17246
min,1.0,-1.0,-1.0,-1.0
25%,3.0,0.0,0.0,0.0
50%,4.0,0.0,0.0,0.0
75%,5.0,1.0,0.0,0.0
max,5.0,1182.0,792.0,404.0


#### Subset Data

Select only data from year 2022.

In [22]:
df_2022 = df[df["date"].astype(str).str.startswith("2022")]
df_2022.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
430101,mG1FavLfA5j2L83sCZ3rFg,BLu9dc1uj_MBgR-Ns9bwQg,drTZrkbpSoJgwKETlFbc3w,1,0,0,0,I bought a Fender 1966 Telecaster that the sal...,2022-01-01 15:47:07
430102,WMpnr1XBJ5U38rfSdErhJQ,0w1Cpzqg0LV93LmrWbmZnA,jyxHti29yWdYR00Itt1A2w,5,0,0,0,This is our go to for take out when I visit my...,2022-01-02 03:49:01
430105,99EMi0lRhdmylbG0soaf9w,QcP1iT3zKu7NQmiIlOg6XA,Jo4ei-c-5H53IxZxAVf1jQ,5,0,0,0,Danielle did a great job! She listened and cu...,2022-01-03 03:17:03
430687,Qs4z8e7hCoU9EzRKD9rGPQ,zH1VutqglmJPSvShRl07vg,YT5CjacTllBtvMaMJS3IbA,1,0,0,0,We saw a lot of roaches in the bathroom when w...,2022-01-05 15:55:59
432403,SXZ2Nw9UGAgPlXJsju9fFA,bvbmmVvkoxzTFzPc89WQhA,9MHe5jAym2d8VhT_NbCRyw,2,0,0,0,We Ordered pork fried rice and beef chow mei ...,2022-01-06 03:59:21


#### Check Row Count

In [23]:
len(df_2022)

31665

#### Check Missing Values

In [24]:
df_2022.isna().sum()

review_id      0
user_id        0
business_id    0
stars          0
useful         0
funny          0
cool           0
text           0
date           0
dtype: int64

No missing value found.

#### Check Unique Values

Unique values will be useful when creating mapping for labels.

In [28]:
np.sort(df["stars"].unique())

array([1, 2, 3, 4, 5], dtype=int64)

#### Label Data

In [30]:
df_2022_labeled = df_2022.copy()

mapping = {1: "negative", 2: "negative", 3: "neutral", 4: "positive", 5: "positive"}
df_2022_labeled.loc[:, "label"] = df_2022_labeled.loc[:, "stars"].map(mapping)

df_2022_labeled.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,label
430101,mG1FavLfA5j2L83sCZ3rFg,BLu9dc1uj_MBgR-Ns9bwQg,drTZrkbpSoJgwKETlFbc3w,1,0,0,0,I bought a Fender 1966 Telecaster that the sal...,2022-01-01 15:47:07,negative
430102,WMpnr1XBJ5U38rfSdErhJQ,0w1Cpzqg0LV93LmrWbmZnA,jyxHti29yWdYR00Itt1A2w,5,0,0,0,This is our go to for take out when I visit my...,2022-01-02 03:49:01,positive
430105,99EMi0lRhdmylbG0soaf9w,QcP1iT3zKu7NQmiIlOg6XA,Jo4ei-c-5H53IxZxAVf1jQ,5,0,0,0,Danielle did a great job! She listened and cu...,2022-01-03 03:17:03,positive
430687,Qs4z8e7hCoU9EzRKD9rGPQ,zH1VutqglmJPSvShRl07vg,YT5CjacTllBtvMaMJS3IbA,1,0,0,0,We saw a lot of roaches in the bathroom when w...,2022-01-05 15:55:59,negative
432403,SXZ2Nw9UGAgPlXJsju9fFA,bvbmmVvkoxzTFzPc89WQhA,9MHe5jAym2d8VhT_NbCRyw,2,0,0,0,We Ordered pork fried rice and beef chow mei ...,2022-01-06 03:59:21,negative


#### Check Label Count

In [31]:
df_2022_labeled["label"].value_counts()

positive    20916
negative     8566
neutral      2183
Name: label, dtype: int64

Dealing with imbalanced classes.

#### Select Columns

Select only `text` and `label` columns.

In [32]:
df_2022_labeled = df_2022_labeled[["text", "label"]]
df_2022_labeled.head()

Unnamed: 0,text,label
430101,I bought a Fender 1966 Telecaster that the sal...,negative
430102,This is our go to for take out when I visit my...,positive
430105,Danielle did a great job! She listened and cu...,positive
430687,We saw a lot of roaches in the bathroom when w...,negative
432403,We Ordered pork fried rice and beef chow mei ...,negative


#### Export to CSV

In [33]:
df_2022_labeled.to_csv("../data/review_2022.csv", index=False)