![ramen_rj.png](https://github.com/rohinijagath/ramen-reviews/blob/main/ramen_rj.png?raw=true)

# Introduction
The Ramen Rater is a product review website for the hardcore ramen enthusiast (or "ramenphile"), with over 2500 reviews to date. This dataset is an export of "The Big List" (of reviews),
Each record in the dataset is a single ramen product review. Stars indicate the ramen quality, as assessed by the reviewer, on a 5-point scale; this is the most important column in the dataset.

This dataset was republished as-is from the original [BIG LIST](https://www.theramenrater.com/). The data set has since been updated to include reviews up until January 25th 2020 and currently reviews 3545 different ramen products.

# Import Packages

In [1]:
# general imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import itertools
import time
import re
import pycountry_convert as pc

# visualizations
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
%matplotlib inline
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator

from sklearn.preprocessing import LabelEncoder, StandardScaler

  import pandas.util.testing as tm


In [3]:
ramen = pd.read_csv('datasets_9366_13206_ramen-ratings.csv')
ramen.head()

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Top Ten
0,2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
1,2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2,2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
3,2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
4,2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


# Data Description
## Data Structure & Types

The ramen review data comprises 2580 reviews or observations. Each review has the following fields:
- 'Review #' - unique identifier for each ramen product reviewed.
- 'Brand' - the Ramen brand
- 'Variety' - the name of the ramen product, indicative of the type, flavour and other defining characteristics.
- 'Style' - the type of packaging for the ramen product i.e. cup or pack.
- 'Country' - the country of production / origination
- 'Stars' - the rating provided by Hans Lienesch on the website https://www.theramenrater.com/.
- 'Top Ten' - indicates whether the product is in the top ten ramen reviews.

In [4]:
ramen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2580 entries, 0 to 2579
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Review #  2580 non-null   int64 
 1   Brand     2580 non-null   object
 2   Variety   2580 non-null   object
 3   Style     2578 non-null   object
 4   Country   2580 non-null   object
 5   Stars     2580 non-null   object
 6   Top Ten   41 non-null     object
dtypes: int64(1), object(6)
memory usage: 141.2+ KB


All data types seem appropriate with the exception of 'Stars' which should ideally be numeric instead of an object. Let's delve deeper into this ramen bowl.

## Data Completeness
### Style
The 'Style' is missing for two of the ramen products.

In [6]:
style = ramen[ramen['Style'].isnull()]
style

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Top Ten
2152,428,Kamfen,E Menm Chicken,,China,3.75,
2442,138,Unif,100 Furong Shrimp,,Taiwan,3.0,


In [7]:
ramen.Style.mode()

0    Pack
dtype: object

The missing styles could be imputed by mode ('Pack'). Incidentally, when checked with the updated data set on the ramenrater.com, these two products are confirmed to come in a 'pack' form.

In [8]:
# fill null values for style
ramen['Style'].fillna('Pack', inplace=True)

# check for null values 
ramen.isnull().sum()

Review #       0
Brand          0
Variety        0
Style          0
Country        0
Stars          0
Top Ten     2539
dtype: int64

### Ratings
It appears that three of the ramen products featured in the data set are 'Unrated', hence the object data type.

In [10]:
# create df of unrated products
unrated = ramen[ramen['Stars'] == 'Unrated']
unrated

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Top Ten
32,2548,Ottogi,Plain Instant Noodle No Soup Included,Pack,South Korea,Unrated,
122,2458,Samyang Foods,Sari Ramen,Pack,South Korea,Unrated,
993,1587,Mi E-Zee,Plain Noodles,Pack,Malaysia,Unrated,


In [11]:
# filter out unrated ramen products
ramen = ramen[ramen['Stars'] != 'Unrated']

These entries will be dropped.

### Top Ten
The 'Top Ten' field indicates if a specific product was awarded a place on the annual top ten list.

In [12]:
# determine values of 
ramen['Top Ten'].value_counts()

\n          4
2014 #1     1
2015 #4     1
2016 #8     1
2016 #1     1
2015 #9     1
2013 #1     1
2013 #9     1
2012 #7     1
2013 #3     1
2016 #9     1
2013 #2     1
2012 #1     1
2012 #4     1
2014 #8     1
2014 #4     1
2013 #6     1
2015 #7     1
2014 #9     1
2014 #5     1
2013 #4     1
2013 #10    1
2012 #5     1
2014 #6     1
2012 #2     1
2015 #1     1
2015 #6     1
2015 #8     1
2012 #9     1
2016 #7     1
2012 #6     1
2016 #10    1
2014 #7     1
2012 #3     1
2012 #10    1
2015 #10    1
2016 #5     1
2014 #10    1
Name: Top Ten, dtype: int64

In [13]:
ramen['Top Ten'].isnull().sum()

2536

As the top ten list was awarded between 2012 and 2016 only and the majority of these entries are 'NaN'. The column will be dropped from the main dataframe

In [14]:
#drop 'Top Ten' from ramen df
ramen.drop(['Top Ten'], inplace=True, axis=1)

#drop null values
ramen.dropna(inplace=True)

In [15]:
ramen.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2577 entries, 0 to 2579
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Review #  2577 non-null   int64 
 1   Brand     2577 non-null   object
 2   Variety   2577 non-null   object
 3   Style     2577 non-null   object
 4   Country   2577 non-null   object
 5   Stars     2577 non-null   object
dtypes: int64(1), object(5)
memory usage: 140.9+ KB


# Feature Engineering
## Ramen Description Length
This feature examines the length of the product description.