# Data Analysis with Python and Pandas Tutorial
# Part 1: Loading and Exploring data

## Tutorial Objectives

In this tutorial, you will learn:

* Create Pandas dataframes directly from JSON data
* Create dataframes by loading data from CSV, Excel, and over the internet
* Explore data using shape, columns, info and describe
* Explore data by plotting plots

## What is a DataFrame / Series?

In [None]:
### Series and Dataframes in Pandas
from IPython.display import Image
from IPython.core.display import HTML
Image(url='https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png', width=600)

## Common libraries

In [None]:
# import the Pandas library
import pandas as pd

## Creating a DateFrame (from JSON)

In [None]:
# create some data in JSON format
weather_data = { 
  "City": ["Cantho", "Danang", "Haiphong", "Hanoi", "Ho Chi Minh City"],
  "Temperature": [83, 64, 72, 81, 70],
  "Humidity": [86, 65, 90, 75, 96]
}

In [None]:
# create a Pandas dataframe from the JSON data
df1 = pd.DataFrame(weather_data)

In [None]:
# output the dataframe
df1

In [None]:
# output the shape of the dataframe
df1.shape

In [None]:
# output the list of columns (Series)
df1.columns

In [None]:
# output detailed dataframe info
df1.info()

In [None]:
# describe the values in the dataframe
df1.describe()

In [None]:
# set the index to City
# output it again to see the difference (look at the City column)
df1.set_index(keys='City', inplace=True)
df1

In [None]:
# plot the dataframe, as a 'bar' or 'barh'
df1.plot(kind='bar', figsize=(6, 3), title='Vietnam Major Cities Weather')

In [None]:
# save this dataset to disk in CSV format
df1.to_csv('simple_weather.csv')

## Reading Excel data into a DataFrame

In [None]:
# read an Excel file into a DataFrame (7dtd-weapons.xlsx, sheet name = "All")
# this should be available at: https://1drv.ms/x/s!AgtH78k0_cuvgkgBijDifi1YkIck
df2 = pd.read_excel('7dtd-weapons.xlsx', sheet_name='All')

In [None]:
# output the dataframe, but just a few (5) rows
df2.head()

In [None]:
# output the shape of the dataframe
df2.shape

In [None]:
# output dataframe info
df2.info()

In [None]:
# potential issue: "type" is string - probably supposed to be categorical
# convert column "Category" to a categorical series (an enumeration)
# output info again to see the difference
df2.Type = df2.Type.astype('category')

In [None]:
# potential issue: columns have spaces in them
# rename columns "Magazine Size" to "MagSize"
# output info again to see the difference
df2.rename(columns={'Magazine Size': 'MagSize'}, inplace=True)

In [None]:
# change the index to column 'Firearm', inplace
# output the head again to see the difference
#df2.reset_index(inplace=True)  # reset_index restores the original numeric index
df2.set_index(keys='Firearm', inplace=True)

In [None]:
# plot the head of the dataframe, with kind "barh"
df2.head().plot(kind='barh', figsize=(12, 4), title='Firearms Overview')

In [None]:
# reset the index again
df2.reset_index(inplace=True)

## Reading CSV data (from a URL) into a DataFrame

In [None]:
url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'

In [None]:
# read a CSV file into DataFrame df3
df3 = pd.read_csv(url)

In [None]:
# output dataframe info
df3.info()

In [None]:
# output a few rows (10)
df3.head()

In [None]:
# take columns total_bill and tip and plot boxplot
#df3.plot(kind='box', figsize=(5, 10), subplots=False, by='sex')
df3.boxplot(column=['total_bill', 'tip'], by='sex', figsize=(8, 8))

## Reading CSV data with a time series

In [None]:
oil_url = 'https://datahub.io/core/gold-prices/r/monthly.csv'

In [None]:
df4 = pd.read_csv(oil_url)
df4.shape

In [None]:
df4.head()

In [None]:
df4.Date = df4.Date.astype('datetime64')

In [None]:
#df4.reset_index(inplace=True)  # reset the index back to the original numeric index
df4.set_index('Date', inplace=True)

In [None]:
# output dataframe info, to verify the index is a Datetime index
df4.info()

In [None]:
# plot the dataset
df4.plot(figsize=(12,4), title='Gold Prices')