# Ramen Dataset

![naruto](https://i.pinimg.com/736x/03/30/30/033030e076d4bdf77d4e69750cb21918.jpg)

Ramen is a Japanese noodle soup. It consists of Chinese wheat noodles served in a meat or fish-based broth, often flavored with soy sauce or miso, and uses toppings such as sliced pork, nori, menma, and scallions. 

## Objectives

- Change columns to correct data types
- Clean rows/columns with missing values
- Group by country, brand, etc.
- Create informational graphs based on grouping
- Generate WordCloud(s)

# Libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

plt.style.use('ggplot')

# General Overview of Dataset

In [None]:
df = pd.read_csv('/kaggle/input/ramen-ratings/ramen-ratings.csv')

In [None]:
df.info()

In [None]:
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns')

In [None]:
null_df = pd.DataFrame(df.isnull().sum())
nullpct_df = pd.DataFrame(df.isnull().sum()/len(df))

na_df_stats = null_df.merge(nullpct_df, left_index= True, right_index= True)
na_df_stats.columns = ['NA count', '%']

In [None]:
na_df_stats

# Cleaning and Converting to Correct Dtypes

In [None]:
df = df.loc[~df['Style'].isna(), :]
df.drop(columns='Review #', inplace= True, axis= 1)

In [None]:
df['Stars'] = pd.to_numeric(df.Stars, errors= 'coerce')

In [None]:
print(f'The mean of ramen ratings is: {np.mean(df.Stars)}')
print(f'The standard deviation of ramen ratings is: {np.std(df.Stars)}')

In [None]:
plt.figure(figsize=(15,8))
plt.title('Count of Different Unique Values in Object Columns')
objects = df.select_dtypes(object).drop(columns='Variety', axis=1).apply(pd.Series.nunique)
objects.plot(kind='bar')

In [None]:
df.groupby('Country', as_index= False).agg({'Stars':'mean'}).sort_values(by= 'Stars', ascending= False)

In [None]:
df.groupby('Style', as_index= False)['Stars'].mean()

In [None]:
df.Country.value_counts().tail()

In [None]:
df.Brand.value_counts().tail()

# Creating a balanced dataset
- We want to convert brands and countries that have less than a certain count in the dataset to 'other' or drop completely
- Here, I arbitrarily chose 50

In [None]:
brand_counts = df.Brand.value_counts()
brand_counts = brand_counts[brand_counts > 50]

list1 = []
for i in brand_counts.index:
    list1.append(i)

In [None]:
df1_1 = df.copy()

#This will convert brands with less than 50 counts into 'other'
df1_1['Brand'] = df1_1['Brand'].apply(lambda x: x if x in list1 else 'other')

# Exploring the data through plots

In [None]:
df1_1_groupby = df1_1.groupby('Brand', as_index= False)['Stars'].mean().sort_values(ascending=False, by= 'Stars')

plt.figure(figsize=(15,8))
plt.title('Average Brand Rating')
plt.xticks(rotation= 25, fontsize=12)
sns.barplot(data=df1_1_groupby, x='Brand', y='Stars', palette= 'rocket')

In [None]:
df3 = df1_1.copy()

In [None]:
print('Here we see that the style of ramen is mainly divided into 4 types. Box, Bar, and Can have very little representation in the data. For now, we will ignore')
df3.Style.value_counts()

In [None]:
plt.figure(figsize=(15,8))
plt.title('Ramen Style Rating Histogram')
sns.kdeplot(df3.loc[df['Style'] == 'Pack', 'Stars'], color= 'yellow', label='Pack')
sns.kdeplot(df3.loc[df['Style'] == 'Bowl', 'Stars'], color= 'red', label='Bowl')
sns.kdeplot(df3.loc[df['Style'] == 'Cup', 'Stars'], color= 'blue', label='Cup')
sns.kdeplot(df3.loc[df['Style'] == 'Tray', 'Stars'], color= 'green', label='Tray')

In [None]:
#Convert 'United States' value to 'USA'
df1_1.loc[df1_1.Country == 'United States', 'Country'] = 'USA'

In [None]:
#Find list that has countries with more than 40 reviews
df_country = df1_1.Country.value_counts()
df_country = df_country[df_country > 40]
df_country = df_country.index

df2 = df1_1.loc[df1_1['Country'].isin(df_country), :]

styles = ['Pack', 'Bowl', 'Cup','Tray']
df2 = df2.loc[df2.Style.isin(styles), :]

In [None]:
df2.Country.value_counts()

In [None]:
df2.groupby(['Country', 'Style'])['Stars'].mean().unstack().plot(kind= 'bar', stacked= True, figsize=(15,10))
plt.xticks(rotation= 25, fontsize=12)
plt.title('Average Star Rating By Style Per Country')

# Exploring The 'Top Ten' Column

In [None]:
df3_1 = df2.copy()

df3_1 = df2.loc[df2['Top Ten'].notnull(), :]
df3_1 = df3_1.loc[df3_1['Top Ten'] != '\n', :]

In [None]:
df3_1['Year'] = df3_1['Top Ten'].apply(lambda x: x.split('#')[0])
df3_1['Ranking'] = df3_1['Top Ten'].apply(lambda x: x.split('#')[1])

In [None]:
df3_1

In [None]:
pie = df3_1.loc[df3_1['Ranking'] == '1', :]
pie_grp = pie.Country.value_counts()

plt.figure(figsize=(15,10))
plt.title('Where #1 Rated Ramen Brands Are From')
pie_grp.plot.pie(textprops={'fontsize': 13}, shadow= True)

In [None]:
pie2 = df3_1.Country.value_counts()

plt.figure(figsize=(15,10))
plt.title('Where Top Ten Rated Ramen Brands Are From')
pie2.plot.pie(textprops={'fontsize': 13}, shadow= True)

In [None]:
plt.figure(figsize=(20,10))
plt.yticks(fontsize=18)
plt.xticks(fontsize=18)
plt.title('Count of Ramen Brands in Top Ten')
sns.countplot(data= df3_1, y= 'Brand', order=df3_1.Brand.value_counts().index, palette= 'rocket')

In [None]:
plt.figure(figsize=(15,5))
plt.yticks(fontsize=14)
plt.xticks(fontsize=14)
plt.title('Count of Ramen Styles in Top Ten')
sns.countplot(data= df3_1, y= 'Style', palette='rocket')

# Wordcloud

In [None]:
#Custom function to extract text from variety column
def get_text(column):
    words = ''
    for text in column:
        words += text
    return words

#### Top Ten Rated Ramen Variety WC

In [None]:
text1 = get_text(df3_1['Variety'])

stopwords = set(STOPWORDS)
wc = WordCloud(background_color= 'black', stopwords= stopwords,
              width=1600, height=800)

wc.generate(text1)
plt.figure(figsize=(20,10), facecolor='k')
plt.axis('off')
plt.tight_layout(pad=0)
plt.imshow(wc)
plt.show()

#### All Ramen Variety WC

In [None]:
text2 = get_text(df2['Variety'])

stopwords = set(STOPWORDS)
wc = WordCloud(background_color= 'black', stopwords= stopwords,
              width=1600, height=800)

wc.generate(text1)
plt.figure(figsize=(20,10), facecolor='k')
plt.axis('off')
plt.tight_layout(pad=0)
plt.imshow(wc)
plt.show()

# Analysis

- Singapore, Malasyia, and Indonesia have most ramen brands awarded a top ten ranking from 2010-2016
- The top brand of ramen throughout all countries in dataset is Indomie and Samyang Foods
- Japan and South Korea have the highest rated ramen overall (min 40 reviews)
- In contrast, Canada and UK have the lowest rated (min 40 reviews)
- Wordclouds produced similar results
- Top rated style of ramen is: Pack

# Conclusion

As a current graduate student, I know all too well the struggles of late night studying and ramen devouring. It was fun exploring a product I've become so accustomed to. 

This dataset can be improved if we had more features to examine. Along with more features, if there were more entries... exploring a GLM or any regression algorithm can lead to some fasinating results on how consumers rate their ramen.

Any feedback on how I can improve would be greatly appreciated. This is my first notebook I've uploaded to Kaggle and I hope it helps someone out there.