 ## General Assembly - 30th May 2017 ##

This notebook contains exercises for getting started with visualising data analysis in Python. The 3 main topics we will cover in this class are:
1. Reading in data from different sources
2. Manipulating data in Python
3. Visualising data in Python

### Reading in data from different sources
1. Reading in from a URL
2. Reading in from an excel spreadsheet
3. Reading in from a csv

In [None]:
# Load the Iris dataset from CSV URL
# 1. Import the required libraries
import numpy as np
import pandas as pd
import urllib

In [None]:
# 2. Specify the URL for the Iris dataset (UCI Machine Learning Repository)
url = "http://goo.gl/HppjFh"

# 3. Download the file
raw_data = urllib.request.urlopen(url)

# 4. Load the CSV file as a numpy matrix
#dataset = pd.read_csv(raw_data, delimiter=",")
dataset = pd.read_csv(raw_data, delimiter=",", names=('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'))
#print(dataset.shape)
dataset.head()

# Refer to http://pandas.pydata.org/pandas-docs/version/0.15.0/io.html#io-read-csv-table

In [None]:
# Read data from an excel spreadsheet
# 1. Load the file into python
xl = pd.ExcelFile("iris.xlsx")
# 2. Find what sheets are in the workbook
xl.sheet_names

In [None]:
# 3. Read in the dataset from the 'Iris' sheet
df = xl.parse("iris")
df.head()
df.shape

In [None]:
# Bonus: To write the file to excel format we can use the 'to_excel' method
df.to_excel('iris_saved_v2.xlsx', sheet_name='Sheet1')

In [None]:
# Read data from a csv
iris_data = pd.read_csv('iris.csv')
iris_data.head()

In [None]:
#  To write the file to csv format we can use the 'to_csv' method
df.to_csv('iris_saved.csv')

### Manipulating Data in Python
In this section we will begin summarise the data and get an idea of the distribution of our data and what type of cleaning it requires. This is an essential step of a data science project

In [None]:
# Get a count of the number of rows in the DataFrame
len(iris_data.index)

In [None]:
# Get the dimensions of the DataFrame
iris_data.shape

In [None]:
# Summarise the data
iris_data.describe()

In [None]:
# Select only the observations with petal_length < 1.7
iris_data[(iris_data['petal_length']<1.7)]

In [None]:
# Now let's group the data by the species
byspecies = iris_data.groupby('species')
byspecies.describe()

In [None]:
# Apply a function by a group (Species)
# You can try mean, max, median, etc
byspecies['petal_length'].max()

In [None]:
# We can also aggregate by group (makes little sense in this context but this will come in handy)
byspecies['petal_length'].aggregate(np.sum)

In [None]:
# We can also aggregate by group (makes little sense in this context but this will come in handy)
byspecies['petal_length'].agg([len, np.mean, np.std])

### Visualising data in Python
This Section will deal with visualising data in Python. We will cover different graph types and how to interpret them.

In [None]:
# Display the plots in the notebook with the following command
%matplotlib inline
# Import the graphing libraries we will use
import matplotlib.pyplot as plt

In [None]:
iris_data[['sepal_width']].hist()

In [None]:
# Install another Python library using pip
! pip install seaborn

In [None]:
import seaborn as sns
sns.set(color_codes=True)

In [None]:
iris_data.loc[iris_data['species'] == 'setosa', 'sepal_width'].hist()

In [None]:
# Draw a Scatterplot showing sepal width and length
sns.lmplot(x='sepal_width', y="sepal_length", hue="species", data=iris_data, fit_reg=True)
#iris_data.plot(kind='scatter', x='sepal_width', y='sepal_length');

In [None]:
sns.jointplot(x="sepal_width", y="sepal_length", data=iris_data);

In [None]:
sns.pairplot(iris_data);

In [None]:
sns.pairplot(iris_data, hue='species');

In [None]:
from pandas.tools.plotting import parallel_coordinates
parallel_coordinates (iris_data, 'species')

### Plotly
A nice open source library for interactive visualisations

In [None]:
# To run any command at the system shell, simply prefix it with !
# pip won't work from inside python without it
!pip install plotly --upgrade

In [None]:
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot

print(__version__) # requires version >= 1.9.0

In [None]:
import plotly
from plotly.graph_objs import Scatter, Layout

plotly.offline.init_notebook_mode(connected=True)

plotly.offline.iplot({
    "data": [Scatter(x=[1, 2, 3, 4], y=[4, 3, 2, 1])],
    "layout": Layout(title="hello world")
})

In [None]:
import plotly as py
import plotly.figure_factory as ff
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/school_earnings.csv")

table = ff.create_table(df)
plotly.offline.iplot(table, filename='jupyter/table1')

In [None]:
import plotly.plotly as py
from plotly.graph_objs import *

data = [Bar(x=df.School,
            y=df.Gap)]

plotly.offline.iplot(data, filename='jupyter/basic_bar')

In [None]:

import plotly.plotly as py
from plotly.graph_objs import *

trace_women = Bar(x=df.School,
                  y=df.Women,
                  name='Women',
                  marker=dict(color='#ffcdd2'))

trace_men = Bar(x=df.School,
                y=df.Men,
                name='Men',
                marker=dict(color='#A2D5F2'))

trace_gap = Bar(x=df.School,
                y=df.Gap,
                name='Gap',
                marker=dict(color='#59606D'))

data = [trace_women, trace_men, trace_gap]
layout = Layout(title="Average Earnings for Graduates",
                xaxis=dict(title='School'),
                yaxis=dict(title='Salary (in thousands)'))
fig = Figure(data=data, layout=layout)

plotly.offline.iplot(fig, filename='jupyter/styled_bar')

In [None]:
# Scatter plot with heatmap
x = np.random.randn(2000)
y = np.random.randn(2000)
plotly.offline.iplot([Histogram2dContour(x=x, y=y, contours=Contours(coloring='heatmap')),
       Scatter(x=x, y=y, mode='markers', marker=Marker(color='white', size=3, opacity=0.3))], show_link=False)

In [None]:
# Mapping
df_airports = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_us_airport_traffic.csv')
df_airports.head()

df_flight_paths = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_aa_flight_paths.csv')
df_flight_paths.head()

airports = [ dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = df_airports['long'],
        lat = df_airports['lat'],
        hoverinfo = 'text',
        text = df_airports['airport'],
        mode = 'markers',
        marker = dict(
            size=2,
            color='rgb(255, 0, 0)',
            line = dict(
                width=3,
                color='rgba(68, 68, 68, 0)'
            )
        ))]

flight_paths = []
for i in range( len( df_flight_paths ) ):
    flight_paths.append(
        dict(
            type = 'scattergeo',
            locationmode = 'USA-states',
            lon = [ df_flight_paths['start_lon'][i], df_flight_paths['end_lon'][i] ],
            lat = [ df_flight_paths['start_lat'][i], df_flight_paths['end_lat'][i] ],
            mode = 'lines',
            line = dict(
                width = 1,
                color = 'red',
            ),
            opacity = float(df_flight_paths['cnt'][i])/float(df_flight_paths['cnt'].max()),
        )
    )

layout = dict(
        title = 'Feb. 2011 American Airline flight paths<br>(Hover for airport names)',
        showlegend = False,
        height = 800,
        geo = dict(
            scope='north america',
            projection=dict( type='azimuthal equal area' ),
            showland = True,
            landcolor = 'rgb(243, 243, 243)',
            countrycolor = 'rgb(204, 204, 204)',
        ),
    )

fig = dict( data=flight_paths + airports, layout=layout )

plotly.offline.iplot(fig)