# Python Programming

**Chapter 6 : Data Handling with Python** 

Python is a fun language to learn, and really easy to pick up even if you are new to programming. In fact, quite often, Python is easier to pick up if you do not have any programming experience whatsoever. Python is high level programming language, targeted at students and professionals from diverse backgrounds.

In this chapter, we will cover
- Exploring CSV Dataset
- Exploring JSON Files
- Exploring HTML Tables

**License Declaration** : Following the lead from the inspirations for this material, and the *spirit* of Python education and development, all modules of this work are licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/.

---

## Exploring CSV Dataset

Python is currently the language of choice for Data Analysis; the required libraries are `NumPy`, `Pandas`, `MatPlotLib`, `SeaBorn` and `Scikit-Learn`. In this example, we use the **"Pokemon with stats"** dataset from Kaggle, curated by *Alberto Barradas* (source: https://www.kaggle.com/abcsds/pokemon).

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

### Import CSV file into a DataFrame

CSV is the easiest file format to handle in Pandas. Just use the `read_csv` function to import CSV into a Pandas `DataFrame`.    

In [None]:
# Read the CSV Data
pkmndata = pd.read_csv('files/pokemonData.csv')

In [None]:
# Check the first few rows of the dataset
pkmndata.head()

Description of the dataset, as available on Kaggle, is as follows.
Learn more : https://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon

> **\#** : ID for each Pokemon (runs from 1 to 721)  
> **Name** : Name of each Pokemon  
> **Type 1** : Each Pokemon has a basic Type, this determines weakness/resistance to attacks  
> **Type 2** : Some Pokemons are dual type and have a Type 2 value (set to nan otherwise)  
> **Total** : Sum of all stats of a Pokemon, a general guide to how strong a Pokemon is  
> **HP** : Hit Points, defines how much damage a Pokemon can withstand before fainting  
> **Attack** : The base modifier for normal attacks by the Pokemon (e.g., scratch, punch etc.)  
> **Defense** : The base damage resistance of the Pokemon against normal attacks  
> **SP Atk** : Special Attack, the base modifier for special attacks (e.g. fire blast, bubble beam)  
> **SP Def** : Special Defense, the base damage resistance against special attacks  
> **Speed** : Determines which Pokemon attacks first each round  
> **Generation** : Each Pokemon belongs to a certain Generation  
> **Legendary** : Legendary Pokemons are powerful, rare, and hard to catch

In [None]:
# Check the Data Type
print("Data type : ", type(pkmndata))
print("Data dims : ", pkmndata.shape)
print()
print(pkmndata.dtypes)

### Statistics on Numeric Variables 

You may want to check the basic statistical descriptions and visualize the corresponding statistical plots for *Numeric Variables* as follows.

In [None]:
# Extract only the numeric data variables
numDF = pd.DataFrame(pkmndata[["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]])

# Summary Statistics for all Variables
numDF.describe()

In [None]:
# Visualize the distributions of all variables
f, axes = plt.subplots(6, 3, figsize=(18, 24))
color_list = ["b", "g", "r", "c", "m", "y"]

count = 0
for var in numDF:
    sb.boxplot(numDF[var], ax = axes[count,0], color = color_list[count])
    sb.distplot(numDF[var], ax = axes[count,1], color = color_list[count])
    sb.violinplot(numDF[var], ax = axes[count,2], color = color_list[count])
    count += 1

In [None]:
# Correlation Matrix
print(numDF.corr())

# Heatmap of the Correlation Matrix
f, axes = plt.subplots(1, 1, figsize=(18, 12))
sb.heatmap(numDF.corr(), vmin = -1, vmax = 1, annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "RdBu")

In [None]:
# Draw pairs of variables against one another
sb.pairplot(data = numDF)

### Statistics on Categorical Variables

You may also want to check the basic statistical descriptions and visualize the corresponding statistical plots for the *Categorical Variables*.

In [None]:
# Generations in the Dataset
print("Number of Generations :", len(pkmndata["Generation"].unique()))

# Pokemons in each Generation
print(pkmndata["Generation"].value_counts())
sb.catplot(y = "Generation", data = pkmndata, kind = "count")

In [None]:
# Primary Types in the Dataset
print("Number of Primary Types :", len(pkmndata["Type 1"].unique()))

# Pokemons of each Primary Type
print(pkmndata["Type 1"].value_counts())
sb.catplot(y = "Type 1", data = pkmndata, kind = "count", height = 8)

In [None]:
# Secondary Types in the Dataset
print("Number of Secondary Types :", len(pkmndata["Type 2"].dropna().unique()))

# Pokemons of each Secondary Type
print(pkmndata["Type 2"].dropna().value_counts())
sb.catplot(y = "Type 2", data = pkmndata, kind = "count", height = 8)

#### Quick Tasks

- Do a similar exploration on the Kaggle Housing Prices dataset : https://www.kaggle.com/c/house-prices-advanced-regression-techniques

---

## Exploring JSON Files

If the dataset is in a standard JSON format, we may use the `read_json` function from Pandas. JSON is also a quite common data type in practice.    

In [None]:
# Importing JSON files
cuisdata = pd.read_json('files/cuisineData.json')
cuisdata.head()

Description of the dataset, as available on Kaggle, is as follows.
Learn more : https://www.kaggle.com/c/whats-cooking

> **id** : ID for each Recipe     
> **cuisine** : Type of Cuisine      
> **ingredients** : the list of ingredients of each recipe (of variable length)     

One such example is as follows. I have no clue what the dish is! Let me know if you can guess. ;-)

    {
        "id": 24717,
        "cuisine": "indian",
        "ingredients": [
                        "tumeric",
                        "vegetable stock",
                        "tomatoes",
                        "garam masala",
                        "naan",
                        "red lentils",
                        "red chili peppers",
                        "onions",
                        "spinach",
                        "sweet potatoes"
                       ]
    }

In [None]:
# Check the Data Type
print("Data type : ", type(cuisdata))
print("Data dims : ", cuisdata.shape)

In [None]:
# Extract a single row
cuisdata.iloc[0]

In [None]:
# Extract a single column
cuisdata["ingredients"]

In [None]:
# Extract a single element
cuisdata.loc[0, "ingredients"]

#### Quick Tasks

- Create a new column `ingredient_string` by joining all ingredients of each recipe together into a single string.    

---

## Exploring HTML Tables

In [None]:
# Importing HTML file from the URL
html_data = pd.read_html('https://en.wikipedia.org/wiki/List_of_actors_with_two_or_more_Academy_Awards_in_acting_categories')

In [None]:
# Check the dataset you imported
print("Data type : ", type(html_data))
print("HTML tables : ", len(html_data))

In [None]:
# Check the individual tables
html_data[0]

In [None]:
# Save the table as Pandas Dataframe
awardsDF = pd.DataFrame(html_data[0])
awardsDF.head()

In [None]:
# Check the statistics for Total awards
awardsDF['Total awards'].describe()

In [None]:
# Check the statistics for Total nominations
awardsDF['Total nominations'].describe()

In [None]:
# Check the relationship between nominations and awards
sb.jointplot(x = awardsDF['Total nominations'], y = awardsDF['Total awards'])

#### Quick Tasks

- Do a similar analysis on the Olympic 2016 medals : https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table