# Pandas for Data Analysis: Practicing your Data Analysis Skills

## Outline:

* [Dataset 1: Marvel Comics](#Dataset-1:-Marvel-Comics)
* [Dataset 2: PM 2.5 in Bangkok](#Dataset-2:-PM-2.5-in-Bangkok)
* [Dataset 3: Craft Beers](#Dataset-3:-Craft-Beers)
* [Dataset 4: Nutrition Facts for McDonald's Menu](#Dataset-4:-Nutrition-Facts-for-McDonald's-Menu)
* [Dataset 5: Shelter Animal Outcomes](#Dataset-5:-Shelter-Animal-Outcomes)

## Dataset 1: Marvel Comics

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/marvel-wikia-data.csv')

In [None]:
df.head()

Drop the columns that don't contribute towards the goals of the challenge

In [None]:
to_drop = ['GSM','urlslug', 'page_id', 'EYE', 'HAIR', 'ID']
df_cleaned = df.drop(to_drop, axis='columns')
df_cleaned.head()

In [None]:
df_cleaned.columns = map(str.lower, df_cleaned.columns)
df_cleaned.head()

Fill empty values for appearances, align and sex

In [None]:
df_cleaned.appearances = df_cleaned.appearances.fillna(1)
df_cleaned['align'] = df_cleaned['align'].fillna('Unknown')
df_cleaned.sex = df_cleaned.sex.fillna('Unknown')
df_cleaned.info()

Remove the word characters from alive, sex and align

In [None]:
df_cleaned['alive'] = df_cleaned['alive'].str.replace(' Characters', '')
df_cleaned['sex'] = df_cleaned['sex'].str.replace(' Characters', '')
df_cleaned['align'] = df_cleaned['align'].str.replace(' Characters', '')
df_cleaned.head()

### Most popular characters based on the number of appearances over the years?

In [None]:
%matplotlib inline

In [None]:
newdf = df_cleaned.sort_values(by=['appearances'], ascending=False).head(10)
newdf.plot(kind='bar', x='name', y='appearances')

### Years with most and least new characters?

In [None]:
import seaborn as sns

Get the year with most and least new Marvel characters introduced respectively, return a (max_year, min_year) tuple. Expect min/max to be pretty far apart.

In [None]:
new = df_cleaned.groupby(df_cleaned['year'])['name'].count().reset_index()
new

In [None]:
min_new_chars = new.sort_values(by=['name'])['year'].values[0]
max_new_chars = new.sort_values(by=['name'], ascending=False)['year'].values[0]

In [None]:
print('Year with the most new characters', int(max_new_chars), 'and the year with the least new characters', int(min_new_chars))

Plot bar graph of character introductions per year

In [None]:
import matplotlib.pyplot as plt

_, ax = plt.subplots(1, 1, figsize=(15, 6))
g = sns.barplot(x='year', y='name', data=new, ax=ax)
g.set_xticklabels(g.get_xticklabels(), rotation=90);

### Percentage of female characters?

In [None]:
sex = df_cleaned.groupby(by=df_cleaned['sex'])['name'].count().reset_index(name='count')
sex['percent'] = sex['count'] / sex['count'].sum() * 100
sex['percent'] = sex['percent'].round(2)
sex = sex.set_index('sex')
percentagefemale = sex.at['Female', 'percent']

In [None]:
print(f'Percentage of female characters {percentagefemale}%')

### Good vs. bad characters?

In [None]:
# Group by alignment and sex and do a count for each group
goodvbad = df_cleaned.groupby(by=['align', 'sex'])['name'].count().reset_index(name='count')

goodvbad['percent'] = goodvbad['count'] / goodvbad['count'].sum() * 100
goodvbad['percent'] = goodvbad['percent'].round(2)
goodvbad

In [None]:
goodvbad['alignsex'] = goodvbad['align'] + goodvbad['sex']
goodvbad

Create a chart of the distribution of alignment and sex

In [None]:
sns.barplot(x='percent', y='alignsex', data=goodvbad)

---

## Dataset 2: PM 2.5 in Bangkok

ข้อมูลจาก http://berkeleyearth.lbl.gov/air-quality/local/Thailand/Bangkok/Bangkok

In [None]:
columns = ['Year', 'Month', 'Day', 'UTC Hour', 'PM2.5', 'PM10_mask', 'Retrospective']
df = pd.read_csv('data/Bangkok.txt', sep='\t', skiprows=range(0, 10), header=None, names=columns)

In [None]:
df.head()

## Dataset 3: Craft Beers

https://www.kaggle.com/nickhould/craft-cans

* What is the number of breweries in each state?
* What is the city with most breweries?
* What is the average alcohol by volume brewed in each state?
* What is the most commonly brewed beer?
* What city brews the strongest beers?
* What is the most popular beer in North Dakota?

**Note:**
* `abv`: The alcoholic content by volume with 0 being no alcohol and 1 being pure alcohol
* `ibu`: International bittering units, which describe how bitter a drink is
* `name`: Name of the beer
* `style`: Beer style (lager, ale, IPA, etc.)
* `ounces`: Size of beer in ounces

In [None]:
import pandas as pd

In [None]:
beers = pd.read_csv('data/beers.csv', index_col=0)

In [None]:
beers.head()

In [None]:
breweries = pd.read_csv('data/breweries.csv', index_col=0)

In [None]:
breweries.head()

In [None]:
breweries['id'] = breweries.index

In [None]:
breweries.head(2)

In [None]:
df = pd.merge(beers, breweries, left_on='brewery_id', right_on='id')

In [None]:
df.head()

## Dataset 4: Nutrition Facts for McDonald's Menu

https://www.kaggle.com/mcdonalds/nutrition-facts

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/mcdonald-menu.csv')

In [None]:
df.head()

In [None]:
df.loc[df['Calories'].idxmax(), :]['Item']

In [None]:
df.loc[df['Calories'].idxmin(), :]['Item']

In [None]:
df.sort_values('Calories', ascending=False)

In [None]:
df.groupby('Category').mean()

In [None]:
import seaborn as sns

In [None]:
g = sns.boxplot(x='Category', y='Calories', data=df)
g.set_xticklabels(g.get_xticklabels(), rotation=45);

## Dataset 5: Shelter Animal Outcomes

https://www.kaggle.com/c/shelter-animal-outcomes

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
animals = pd.read_csv('data/shelter.csv')

In [None]:
animals.head()

In [None]:
animals.AgeuponOutcome.value_counts().plot(kind='bar', figsize=(10, 6))

In [None]:
sns.countplot(data=animals, x=animals.AgeuponOutcome)

In [None]:
_, ax = plt.subplots(1, 1, figsize=(10, 6))
g = sns.countplot(data=animals, x=animals.AgeuponOutcome, ax=ax)
g.set_xticklabels(g.get_xticklabels(), rotation=90);

In [None]:
def get_age_in_days(age_upon_outcome):
    if str(age_upon_outcome) == 'nan':
        return 0
    time_value, unit = age_upon_outcome.split(' ')
    if unit == 'year' or unit == 'years':
        return int(time_value) * 365
    if unit in ['month', 'months']:
        return int(time_value) * 30
    if unit in ['week', 'weeks']:
        return int(time_value) * 7
    if unit in ['day', 'days']:
        return int(time_value)

In [None]:
animals['AgeInDays'] = animals.AgeuponOutcome.map(get_age_in_days)

In [None]:
animals.boxplot(column=['AgeInDays'], by='OutcomeType', figsize=(10, 6))

In [None]:
f, ax = plt.subplots(1, 1, figsize=(10, 6))
sns.boxplot(data=animals, x='OutcomeType', y='AgeInDays', ax=ax)

In [None]:
pd.cut(animals.AgeInDays, list(range(0, 7000, 100))).head(5)
animals.groupby(pd.cut(animals.AgeInDays, list(range(0, 7000, 350)))).mean()
avg_data_by_age = animals.groupby(pd.cut(animals.AgeInDays, list(range(0, 7000, 350)))).count()
avg_data_by_age.AgeInDays.plot(kind='bar')

In [None]:
sns.distplot(animals.AgeInDays, bins=20, kde=False)

ดู distribution ของสัตว์แต่ละชนิด

In [None]:
animals.AnimalType.value_counts().plot(kind='bar')

In [None]:
sns.countplot(data=animals, x=animals.AnimalType)

ดู distribution ของ outcome type

In [None]:
animals['OutcomeType'].value_counts().plot(kind='bar')

In [None]:
sns.countplot(data=animals, x=animals.OutcomeType)

เทียบ distribution ของสัตว์แต่ละชนิดโดยแยกตาม outcome type

In [None]:
_, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
animals[['AnimalType', 'OutcomeType']].groupby(['OutcomeType', 'AnimalType']).size().unstack().plot(kind='bar', ax=ax1, rot=0)
animals[['AnimalType', 'OutcomeType']].groupby(['AnimalType', 'OutcomeType']).size().unstack().plot(kind='bar', ax=ax2, rot=0)

In [None]:
_, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.countplot(data=animals, x='OutcomeType', hue='AnimalType', ax=ax1)
sns.countplot(data=animals, x='AnimalType',hue='OutcomeType', ax=ax2)

ดู distribution ของเพศ

In [None]:
animals['SexuponOutcome'].value_counts().plot(kind='bar')

In [None]:
sns.countplot(data=animals, x=animals.SexuponOutcome)

In [None]:
_, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
animals[['SexuponOutcome', 'OutcomeType']].groupby(['OutcomeType', 'SexuponOutcome']).size().unstack().plot(kind='bar', ax=ax1)
animals[['SexuponOutcome', 'OutcomeType']].groupby(['SexuponOutcome', 'OutcomeType']).size().unstack().plot(kind='bar', ax=ax2)

In [None]:
_, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
sns.countplot(data=animals, x='OutcomeType', hue='SexuponOutcome', ax=ax1)
sns.countplot(data=animals, x='SexuponOutcome',hue='OutcomeType', ax=ax2)

In [None]:
def get_sex(x):
    x = str(x)
    if 'Male' in x: return 'male'
    if 'Female' in x: return 'female'
    return 'unknown'

In [None]:
animals['Sex'] = animals.SexuponOutcome.apply(get_sex)

In [None]:
animals.Sex.value_counts().plot(kind='bar')

In [None]:
sns.countplot(x=animals.Sex)

In [None]:
def get_neutered(x):
    x = str(x)
    if 'Spayed' in x: return 'neutered'
    if 'Neutered' in x: return 'neutered'
    if 'Intact' in x: return 'intact'
    return 'unknown'

In [None]:
animals['Neutered'] = animals.SexuponOutcome.apply(get_neutered)

In [None]:
animals.Neutered.value_counts().plot(kind='bar')

In [None]:
sns.countplot(x=animals.Neutered)

In [None]:
_, (ax1, ax2) = plt.subplots(2, 2, figsize=(16, 8), )
animals[['Sex', 'OutcomeType']].groupby(['OutcomeType', 'Sex']).size().unstack().plot(kind='bar', ax=ax1[0], rot=0)
animals[['Sex', 'OutcomeType']].groupby(['Sex', 'OutcomeType']).size().unstack().plot(kind='bar', ax=ax1[1], rot=0)
animals[['Neutered', 'OutcomeType']].groupby(['OutcomeType', 'Neutered']).size().unstack().plot(kind='bar', ax=ax2[0], rot=0)
animals[['Neutered', 'OutcomeType']].groupby(['Neutered', 'OutcomeType']).size().unstack().plot(kind='bar', ax=ax2[1], rot=0)

In [None]:
_, (ax1, ax2) = plt.subplots(2, 2, figsize=(16, 8))
sns.countplot(data=animals, x='OutcomeType', hue='Sex', ax=ax1[0])
sns.countplot(data=animals, x='Sex', hue='OutcomeType', ax=ax1[1])
sns.countplot(data=animals, x='OutcomeType', hue='Neutered', ax=ax2[0])
sns.countplot(data=animals, x='Neutered', hue='OutcomeType', ax=ax2[1])