# ## General Assembly DAT11 Sydney - 26th Feb 2018 ##

## Reading data from some simple sources

This notebook contains exercises for getting started with visualising data analysis in Python. The 3 main topics we will cover in this class are:
1. Reading in data from different sources
2. Manipulating data in Python

### Reading in data from different sources
1. Reading in from a URL
2. Reading in from an excel spreadsheet
3. Reading in from a csv

In [1]:
# Load the Iris dataset from CSV URL
# 1. Import the required libraries
import numpy as np
import pandas as pd
import urllib
import xlrd

In [2]:
# 2. Specify the URL for the Iris dataset (UCI Machine Learning Repository)
url = "http://goo.gl/HppjFh"

# 3. Download the file
raw_data = urllib.request.urlopen(url)

# 4. Load the CSV file as a numpy matrix
#dataset = pd.read_csv(raw_data, delimiter=",")
dataset = pd.read_csv(raw_data, delimiter=",", names=('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'))
#print(dataset.shape)
dataset.head()

# Refer to http://pandas.pydata.org/pandas-docs/version/0.15.0/io.html#io-read-csv-table

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [10]:
# Read data from an excel spreadsheet
# 1. Load the file into python
xl = pd.ExcelFile("../../../../data/iris.xlsx")
# 2. Find what sheets are in the workbook
xl.sheet_names

['iris']

In [11]:
# 3. Read in the dataset from the 'Iris' sheet
df = xl.parse("iris")
df.head()
df.shape

(150, 5)

In [12]:
# Bonus: To write the file to excel format we can use the 'to_excel' method
df.to_excel('iris_saved_v2.xlsx', sheet_name='Sheet1')

In [15]:
# Read data from a csv
iris_data = pd.read_csv('../../../../data/iris.csv')
iris_data.head()

Unnamed: 0.1,Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0,5.1,3.5,1.4,0.2,setosa
1,1,4.9,3.0,1.4,0.2,setosa
2,2,4.7,3.2,1.3,0.2,setosa
3,3,4.6,3.1,1.5,0.2,setosa
4,4,5.0,3.6,1.4,0.2,setosa


In [16]:
#  To write the file to csv format we can use the 'to_csv' method
df.to_csv('iris_saved2.csv')

### Manipulating Data in Python
In this section we will begin summarise the data and get an idea of the distribution of our data and what type of cleaning it requires. This is an essential step of a data science project

In [17]:
# Get a count of the number of rows in the DataFrame
len(iris_data.index)

150

In [None]:
# Get the dimensions of the DataFrame
iris_data.shape

In [None]:
# Summarise the data
iris_data.describe()

In [None]:
# Select only the observations with petal_length < 1.7
iris_data[(iris_data['petal_length']<1.7)]

In [None]:
# Now let's group the data by the species
byspecies = iris_data.groupby('species')
byspecies.describe()

In [None]:
# Apply a function by a group (Species)
# You can try mean, max, median, etc
byspecies['petal_length'].max()

In [None]:
# We can also aggregate by group (makes little sense in this context but this will come in handy)
byspecies['petal_length'].aggregate(np.sum)

In [None]:
# We can also aggregate by group (makes little sense in this context but this will come in handy)
byspecies['petal_length'].agg([len, np.mean, np.std])