# <center><h1>Animal Crossing<h1></center>
Animal Crossing is a social simulation video game series developed and published by Nintendo and created by Katsuya Eguchi and Hisashi Nogami. In Animal Crossing, the player character is a human who lives in a village inhabited by various anthropomorphic animals, carrying out various activities such as fishing, bug catching, and fossil hunting. The series is notable for its open-ended gameplay and extensive use of the video game console's internal clock and calendar to simulate real passage of time.

Since it's release the game has had an astounding world-wide reception. Some of the popular critics have rated the game highly and have termed it 'meditative'. Apparently the users have a very calming effect while playing and find the progress enjoyable. 

![Image](https://venturebeat.com/wp-content/uploads/2020/02/animal-crossing-new-horizons.jpg?fit=578%2C353&strip=all)

This kernel attempts to explore some aspects of the game such as the items and villagers. It will also analyze user and critic reviews along with a sentiment analysis.

This is a work in progress and will be updated in the coming weeks.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# read data
items = pd.read_csv('/kaggle/input/animal-crossing/items.csv')
user_reviews = pd.read_csv('/kaggle/input/animal-crossing/user_reviews.csv')
critic = pd.read_csv('/kaggle/input/animal-crossing/critic.csv')
villagers = pd.read_csv('/kaggle/input/animal-crossing/villagers.csv')

In [None]:
# Standard plotly imports
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode

import plotly.express as px
import plotly.io as pio

# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

pd.set_option('display.max_columns',500)

# ITEMS

In [None]:
items.head()

The first thing to note in the items dataset is that the recipe, recipe_id and sources column is split into rows for the same item. See example below for the Acoustic Guitar and Rusted Part Items. We need to be careful when aggregating data.


In [None]:
items[items.name == 'Acoustic Guitar'].head()

In [None]:
items[items.name == 'Rusted Part'].head()

In [None]:
items.shape[0], items.name.nunique()

In [None]:
missing_values = items.isna().sum().sort_values(ascending=False)
missing_values = pd.DataFrame({'Feature':missing_values.index, 'Missing Value Count':missing_values.values})

In [None]:
# plot missing values
fig = px.bar(missing_values[::-1], x= 'Missing Value Count', y='Feature', orientation='h',text='Missing Value Count',
             title='Missing Value Count in Features - Items Dataset',template="plotly_dark")
fig.show()

The recipe column has 87% missing values. And as mentioned earlier the recipe is split across the rows creating duplicate items in the items dataset.  The sources column also has ~80% missing values. Unless we have sufficient knowledge about the dataset we will not be able to fix the missing values. Or we will need to create a predictive model just to impute these values. 

In [None]:
# drop the recipe column 
items = items.drop(['recipe','recipe_id','sources'], axis=1)

# drop duplicates 
items = items.drop_duplicates()

In [None]:
items.shape[0], items.name.nunique()

The following are the some of the initial findings after looking at the data:

* The raw data consists of 4,565 rows and 13 columns.
* The different recipe_id, reecipe and sources column for the same item are populated in different rows. 
* The above mentioned columns have more than 80% missing values and have been dropped from the analysis.
* Some columns such as num_id, id, games_id, id_full and image_url are not of much interest.
* There are a total of 4,200 distinct items in the datatset.

The next step is to fill the missing values in the items data. The following methods will be used for the imputing missing values in the different columns:

* For customizable column - fill with 'Non Customizable' when the value is either missing or False. The true values will be replaced with 'Customizable'.
* For orderable column - the same strategy as above. Column will be populated as 'Orderable' and 'Non Orderable'.
* buy_currency will be filled with a new category as 'None' since you cannot buy the items and buy_value will be filled with 0.
* sale_value is 1/4th the buy_value and will be populated accordingly where the buy value is available. Otherwise the value is 0.
* sale currency is filled with 0 for the missing values.

In [None]:
items['customizable'] = np.where(items['customizable'] == True,'Customizable','Non Customizable')
items['orderable'] = np.where(items['orderable'] == True,'Orderable','Not Orderable')
items['buy_currency'] = items['buy_currency'].fillna('None')
items['buy_value'] = items['buy_value'].fillna(0)
items['sell_value'] = np.where(items['sell_value'].isnull(), items['buy_value']/4,items['sell_value'])

In [None]:
cat_counts = items.groupby('category')['category'].count()
cat_counts = pd.DataFrame({'Category':cat_counts.index,'Count':cat_counts.values})

# <center>What are the different Item Categories?</center>

In [None]:
fig = px.pie(cat_counts, values='Count', names='Category', title='Category Distribution for the Items', 
             template="plotly_dark",height=500)
fig.show()

In [None]:
# cutomizable counts
customizable = items[items.customizable == 'Customizable'].groupby('category')['category'].count()
non_customizable = items[items.customizable == 'Non Customizable'].groupby('category')['category'].count().sort_values(ascending=False)

# orderable counts
orderable = items[items.orderable == 'Orderable'].groupby('category')['category'].count()
non_orderable = items[items.orderable == 'Not Orderable'].groupby('category')['category'].count().sort_values(ascending=False)

# <center>Can I customize my socks?</center>

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Non Customizable', x=non_customizable.index, y=non_customizable.values,marker_color='cyan'),
    go.Bar(name='Customizable', x=customizable.index, y=customizable.values, marker_color='chartreuse')
])
# Change the bar mode
fig.update_layout(barmode='group',template="plotly_dark",title_text='Customizable and Non-Customizable Category Counts',height=400)
fig.show()

# <center>What can I not order?<center>

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Non Orderable', x=non_orderable.index, y=non_orderable.values,marker_color='cyan'),
    go.Bar(name='Orderable', x=orderable.index, y=orderable.values, marker_color='chartreuse')
])
# Change the bar mode
fig.update_layout(barmode='group',template="plotly_dark",title_text='Orderable and Non-Orderable Category Counts',height=400)
fig.show()

# <center>How much does the Royal Crown Cost?</center>

In [None]:
expensive_items = items[['name','buy_value']].sort_values(by='buy_value', ascending=False).head(10)
cheapest_items = items[items.buy_value > 0][['name','buy_value']].sort_values(by='buy_value').head(10)

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Most Expensive Items', x=expensive_items[::-1].buy_value, y=expensive_items[::-1].name,marker_color='cornsilk',
           orientation='h'),
])
# Change the bar mode
fig.update_layout(template="plotly_dark",title_text='Top 10 Expensive Items',height=400)
fig.show()

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Cheapest Items', x=cheapest_items.buy_value, y=cheapest_items.name,marker_color='coral',
           orientation='h'),
])
# Change the bar mode
fig.update_layout(template="plotly_dark",title_text='Top 10 Cheapest Items',height=400)
fig.show()

# <center>Buying & Selling Values by Category<center>

In [None]:
cat_buy_value = items.groupby('category')['buy_value'].median().sort_values(ascending=False)
cat_sale_value = items.groupby('category')['sell_value'].median().sort_values(ascending=False)

categories = items.category.unique()

In [None]:
fig = go.Figure()
for cats in categories:
    fig.add_trace(go.Violin(x=items['category'][items['category'] == cats],
                            y=items['buy_value'][items['category'] == cats],
                            name=cats,
                            box_visible=False,
                            meanline_visible=False,jitter=0.05))

fig.update_layout(template="plotly_dark",title_text='Buy Value Distribution by Category',height=400)

fig.show()

In [None]:
fig = go.Figure()
for cats in categories:
    fig.add_trace(go.Violin(x=items['category'][items['category'] == cats],
                            y=items['sell_value'][items['category'] == cats],
                            name=cats,
                            box_visible=False,
                            meanline_visible=False,jitter=0.05))

fig.update_layout(template="plotly_dark",title_text='Sell Value Distribution by Category',height=400)

fig.show()

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Category', x=cat_buy_value.index, y=cat_buy_value.values,marker_color='yellow')
])
# Change the bar mode
fig.update_layout(template="plotly_dark",title_text='Buy Value (Median)',height=400)
fig.show()

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Category', x=cat_sale_value.index, y=cat_sale_value.values,marker_color='red')
])
# Change the bar mode
fig.update_layout(template="plotly_dark",title_text='Sale Value (Median)',height=400)
fig.show()

## Conclusions from the Items Dataset:
* There are total of 21 categories in the dataset. Furniture and Photos are dominant and account for 45% of the data. 
* Some categories such Fruit(6) andSeashells (8) have extremely low counts. Maybe these are very special to obtain.
* Only the Furnitue and Tools are customizable. All the rest are not customizable. Wonder why we can't customize dresses, hats etc.
* Only a few items such as furniture, flooring, tops and similar items can be ordered. 
* Some items like the royal crown that belong to the 'hats' category are extremely expenive and cost around 1.2M.
* The cheapest items are typically the photos. There are some items that are free, that is can be only found and sold. 
* Based on the median values, music is the most expensive category to buy.
* The fossils have the highest sale value. 

# VILLAGERS

In [None]:
villagers.head()

In [None]:
missing_values_villagers = villagers.isna().sum().sort_values(ascending=False)
missing_values_villagers = pd.DataFrame({'Feature':missing_values_villagers.index, 'Missing Value Count':missing_values_villagers.values})

In [None]:
# plot missing values
fig = px.bar(missing_values_villagers[::-1], x= 'Missing Value Count', y='Feature', orientation='h',text='Missing Value Count',
             title='Missing Value Count in Features - Villagers Dataset',template="plotly_dark",height=400)
fig.show()

# <center>What are the dominant villager species?</center>

In [None]:
species_count = villagers.groupby('species')['species'].count()
species_count = pd.DataFrame({'Species':species_count.index,'Count':species_count.values})

fig = px.pie(species_count, values='Count', names='Species', title='Species Distribution for the Villagers', 
             template="plotly_dark",height=500)
fig.show()

In [None]:
males = villagers[villagers.gender == 'male'].groupby('species')['species'].count().sort_values(ascending=False)
females = villagers[villagers.gender == 'female'].groupby('species')['species'].count().sort_values(ascending=False)

fig = go.Figure(data=[
    go.Bar(name='Males', x=males[::-1].index, y=males[::-1].values,marker_color='lightskyblue'),
    go.Bar(name='FeMales', x=females.index, y=males.values,marker_color='lightsalmon')
])
# Change the bar mode
fig.update_layout(barmode='group',template="plotly_dark",title_text='Species Counts - Males & Females',height=400)
fig.show()

# <center>Are the villagers cranky?<center>

In [None]:
personality = villagers.groupby('personality')['personality'].count().sort_values()
males = villagers[villagers.gender == 'male'].groupby('personality')['personality'].count().sort_values(ascending = False)
females = villagers[villagers.gender == 'female'].groupby('personality')['personality'].count().sort_values(ascending = False)

fig = go.Figure(data=[
    go.Bar(name='Personality', x=personality[::-1].index, y=personality[::-1].values,marker_color=' steelblue')
])
fig.update_layout(barmode='group',template="plotly_dark",title_text='Personality Types for Villagers',height=400)
fig.show()

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Males', x=males[::-1].index, y=males[::-1].values,marker_color='lightskyblue'),
    go.Bar(name='FeMales', x=females.index, y=males.values,marker_color='lightsalmon')
])
fig.update_layout(barmode='group',template="plotly_dark",title_text='Personality - Males and Females',height=400)
fig.show()

### The following are some of the findings for the villagers dataset:
* The dataset is fairly clean with very few missing values.

* There are a total of 390 villagers and 35 species. The species are fairly evenly distributed - overall and by gender. Surprisingly there are no females in the lion species. The bulls have no females and the cows have no males, which is obvious. The maximum number of males are found in the frogs and the maximum number of females are found in cat.

* There are 8 different personality types. They are equally distributed mostly, though uchi (caring) and smug are on the lower side. There is a clear distinction between the personalities of the males and females. The males are mostly lazy and cranky (my wife would agree) and the females are normal and snooty. Even among the females, the number of caring ones are less. Hmmm...

# What are the users saying?
This section will analyze the reviews from the users, which is one of the main focus for analysis. The following will be dealt with:
* User review trends (overall and high ratings)
* sentiment analysis 

In [None]:
user_reviews['date'] = pd.to_datetime(user_reviews['date'], format='%Y-%m-%d', errors='ignore')

In [None]:
user_review_counts = user_reviews.groupby('date')['user_name'].count()

fig = px.line(x=user_review_counts.index, y=user_review_counts.values, range_x=['2020-03-20','2020-05-03'])

fig.update_layout(
    xaxis = dict(title_text = "Date"),
    yaxis = dict(title_text='Review Count'),height=350,title_text='Review Counts')
    
fig.show()

In [None]:
review_grade_count = user_reviews.groupby('grade')['user_name'].count()

fig = go.Figure(data=[
    go.Bar(name='Count', x=review_grade_count[::-1].index, y=review_grade_count[::-1].values,marker_color='red')
])
fig.update_layout(title_text='Review Counts by Score',height=400)
fig.show()

In [None]:
high_rank_trend = user_reviews[user_reviews.grade >= 9].groupby('date')['user_name'].count()

fig = px.line(x=high_rank_trend.index, y=high_rank_trend.values, range_x=['2020-03-20','2020-05-03'])

fig.update_layout(
    xaxis = dict(title_text = "Date"),
    yaxis = dict(title_text='Review Count'),height=400,title_text='Review Counts - Grade >= 9')
    
fig.show()

### Preliminary findings from user reviews:
* There was a spike in the number of reviews on March 24th, 2020. This makes sense since the latest version was released on Mar 20,2020 and there will be that initial buzz. After that there is a considerable drop, and stays consistent, other than a small spike in Apr 28, 2020.

* The highest rating, 10 accounts for 25% of the data. 38% of the data has the lowest rating 0.

The next step is to deal with the review 'text' column. We will also add a new column called 'is_bad_review'. A review will be considered bad if the rating is below 7 (an arbitrary assumption).

In [None]:
# create a column to for bad reviews and good reviews - any review above >7 is good and the rest is bad. 
user_reviews['is_bad_review'] = user_reviews['grade'].apply(lambda x: 1 if x < 7 else 0)

In [None]:
# create a separate dataset with the text column and the newly created 'is_bad_review' column.
reviews_df = user_reviews[['text','is_bad_review']].rename(columns={'text':'review'})

The following will be the steps for cleaning the review column:
* Convert to lowercase
* Tokenize text (split text to words) and remove punctuation
* Remove words that contain numbers
* Remove stop words
* Part of speed tagging such as adjective, noun, verb
* Lemmatize text - convert words to root form (eg: texting to text, relaxing to relax)

We will create functions for these steps.

In [None]:
# define functions for cleaning data
from nltk.corpus import wordnet
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer

def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def clean_text(text):
    # lower text
    text = text.lower()
    # tokenize text and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text
    pos_tags = pos_tag(text)
    # lemmatize text
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    # remove words with only one letter
    text = [t for t in text if len(t) > 1]
    # join all
    text = " ".join(text)
    return(text)

# clean data
reviews_df["review_clean"] = reviews_df["review"].apply(lambda x: clean_text(x))

Now that the review column is clean we, will add a few more columns:
* Create number of words
* Create length of characters
* positivity, neutratlity, negativity score and a combined score for all (based on Vader, a nltk package for sentiment analysis)

In [None]:
# add character count column
reviews_df["Char_Count"] = reviews_df["review"].apply(lambda x: len(x))

# add number of words column
reviews_df["Word_Count"] = reviews_df["review"].apply(lambda x: len(x.split(" ")))

In [None]:
# add sentiment anaylsis columns
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
reviews_df["sentiments"] = reviews_df["review"].apply(lambda x: sid.polarity_scores(x))
reviews_df = pd.concat([reviews_df.drop(['sentiments'], axis=1), reviews_df['sentiments'].apply(pd.Series)], axis=1)

In [None]:
reviews_df[reviews_df["Word_Count"] >= 5].sort_values("pos", ascending = False)[["review", "pos"]].head(10)

Some of the highest positive sentiment reviews corresponding to some great feedback. The very first record has words like amazing, great, good, fantastic, incredible all in the same review. No wonder it has a positive score of 0.930.

In [None]:
reviews_df[reviews_df["Word_Count"] >= 5].sort_values("neg", ascending = False)[["review", "neg"]].head(10)

The highest rated negative sentiment reviews do indeed have some harsh words such as 'Terrible game', 'No Cloud Saving', 'Disgusting practice'. The one review that talks about contradicting the unfair zero's is given a high negative segment. This is wrong. Vader has misunderstood the context here.

In [None]:
import seaborn as sns

for x in [0, 1]:
    subset = reviews_df[reviews_df['is_bad_review'] == x]
    
    # Draw the density plot
    if x == 0:
        label = "Good reviews"
    else:
        label = "Bad reviews"
    sns.distplot(subset['compound'], hist = False, label = label)

The graph shows the distribution of the good and bad reviews with the compund score.  For the most part Vader has classified the the good reviews as positive. And the bad reviews tend to have lower compound sentiment score.

## Coming up - 
Critic reviews text and sentiment analysis