# Data Visualization (2017/18)

## Solutions for Assignment 2 - Exploratory Data Analysis

Presented by Group 16: 
-  Akhil Singh Rana
-  Reddy Kumar Modam Venkataramana

Date: 14.11.2017

### Exercise 1: Choose a data set

We will use the Python Data Analysis Library (Pandas) to work with the data. Pandas provides a rich feature set for working with tabular data including data structures and analysis tools.

A 10 minute introduction to pandas with all neccessary functions can be found here: http://pandas.pydata.org/pandas-docs/stable/10min.html (Quickly skim over the available functionality. You don't have to learn it yet.)

In [2]:
# import the pandas library and call it pd for further usage
import pandas as pd

#### Sample code to load the baseball data and get basic statistics

In [3]:
# read the csv file into a pandas dataframe and print the first lines of the table
filename = "baseball_data.csv"
df = pd.read_csv( filename, header=0 )
df.head()

Unnamed: 0,name,handedness,height,weight,avg,HR
0,Tom Brown,R,73,170,0.0,0
1,Denny Lemaster,R,73,182,0.13,4
2,Joe Nolan,L,71,175,0.263,27
3,Denny Doyle,L,69,175,0.25,16
4,Jose Cardenal,R,70,150,0.275,138


In [4]:
df.describe(include="all")

Unnamed: 0,name,handedness,height,weight,avg,HR
count,1157,1157,1157.0,1157.0,1157.0,1157.0
unique,1151,3,,,,
top,Bobby Mitchell,R,,,,
freq,2,737,,,,
mean,,,72.756266,184.513397,0.186793,45.359551
std,,,2.142272,15.445995,0.106175,74.06511
min,,,65.0,140.0,0.0,0.0
25%,,,71.0,175.0,0.138,1.0
50%,,,73.0,185.0,0.238,15.0
75%,,,74.0,195.0,0.258,55.0


#### Sample code to load the Titanic data and get basic statistics

In [5]:
# read the csv file into a pandas dataframe and print the first lines of the table
filename = "titanic3.csv"
df = pd.read_csv( filename, header=0 )
df.head()


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [6]:
df.dtypes

pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

In [7]:
df.describe(include="all")

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
count,1309.0,1309.0,1309,1309,1046.0,1309.0,1309.0,1309,1308.0,295,1307,486.0,121.0,745
unique,,,1307,2,,,,929,,186,3,27.0,,369
top,,,"Connolly, Miss. Kate",male,,,,CA. 2343,,C23 C25 C27,S,13.0,,"New York, NY"
freq,,,2,843,,,,11,,6,914,39.0,,64
mean,2.294882,0.381971,,,29.881138,0.498854,0.385027,,33.295479,,,,160.809917,
std,0.837836,0.486055,,,14.413493,1.041658,0.86556,,51.758668,,,,97.696922,
min,1.0,0.0,,,0.17,0.0,0.0,,0.0,,,,1.0,
25%,2.0,0.0,,,21.0,0.0,0.0,,7.8958,,,,72.0,
50%,3.0,0.0,,,28.0,0.0,0.0,,14.4542,,,,155.0,
75%,3.0,1.0,,,39.0,1.0,0.0,,31.275,,,,256.0,


#### Your selection:
Selected dataset: Titanic Data Set

### Exercise 2: Data queries

1. Passengers count in each Passenger Class?
2. Number of People survived and not survived in each Passenger Class?

# Exercise 3: Data Analysis 

1. Passengers count in each Passenger Class?

In [8]:
filename = "titanic3.csv"
df = pd.read_csv( filename, header=0 )

df_pclass = df['pclass'].value_counts(ascending=False).reset_index()
# set column names
df_pclass.columns = ['pclass', 'count']
df_pclass.head()

Unnamed: 0,pclass,count
0,3,709
1,1,323
2,2,277


In [9]:
# import bokeh 
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.models import HoverTool

# tell bokeh to show the figures in the notebook
from bokeh.io import output_notebook
output_notebook()

In [10]:
# Create input data for bokeh from the dataframe
source = ColumnDataSource( df_pclass )

In [11]:
#Hover
hover = HoverTool(
    tooltips=[("Passenger Class", '@pclass'),
              ("Total Count", '@count')])

p = figure( title="Passengers Count in each Class", plot_width=500, x_axis_label='Class',
           y_axis_label='Count of Passengers', tools=[hover],
           x_axis_type='linear')
#p.hbar( source=source, y='pclass', height=0.1, right='count')
p.vbar( source=source, x='pclass', width=1.0, bottom=0, top='count')
show(p)

2. Number of People survived and not survived in each Passenger Class?

In [12]:
filename = "titanic3.csv"
df = pd.read_csv( filename, header=0 )
survived_sum = df.groupby('pclass', as_index=False).sum()

In [13]:
# import bokeh 
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.models import HoverTool

# tell bokeh to show the figures in the notebook
from bokeh.io import output_notebook
output_notebook()

In [14]:
# Create input data for bokeh from the dataframe
source = ColumnDataSource( survived_sum )

In [15]:
#Hover
hover = HoverTool(
    tooltips=[("Passenger Class", '@pclass'),
              ("Total Count", '@survived')])

p = figure( title="Passengers Survived in each Class", plot_width=500, x_axis_label='Class',
           y_axis_label='Count of Passengers', tools=[hover],
           x_axis_type='linear')
p.vbar( source=source, x='pclass', width=1.0, bottom=0, top='survived')
show(p)