# Day 2: Session1 : Dictionaries and functions, Numpy and Pandas

This session concludes our tour of Python language essentials with another key data structure - dictionaries - and with writing reusable code blocks (functions). We also introduce the key libraries for handling tabular data: Numpy (short for numerical Python) and Pandas.

In [None]:
import pandas as pd
import pylab as plt
%pylab inline

### 1. Dictionaries
A dictionary contains data in key:value pairs (much like a standard dictionary!)

In [None]:
# dictionaries are defined using curly brackets
{'hello':'bonjour',
'child':'enfant'}

In [None]:
# they contain key-value pairs separated using colons and commas:

student1 = {'name':'Jason',
            'gender':'male',
           'major':'Physics',
           'exam_result':3.8}

In [None]:
# querying a dictionary with one of its keys returns the corresponding value

student1['name']

In [None]:
# dictionaries are mutable (you can update values; duplicate keys are not permitted)

student1['exam_result'] = 3.9

In [None]:
# list all the keys
student1.keys()

In [None]:
# ...and the values:
student1.values()

In [None]:
# check the length of a dictionary

enrollment_roster = {'English_class' : 4,
                    'Spanish_class' : 3,
                    'Python_class' : 65}

len(enrollment_roster)

In [None]:
# iterate through it

for k,v in enrollment_roster.items():
    print(k)
    print("students enrolled", v)
    print()

### 2. Functions
So you have written some code for a difficult task (eg. solve Fermat's Last Theorem). You may want to do the same task again. You could (a) memorize the code and re-write it each time; or (b) write a function. A function is a reusable code block. You can pass data into functions (as parameters). Functions can return data to the main program.

In [None]:
# define a function

def my_function():
    print("Hi I'm a function")

In [None]:
# once defined, call it once or many times

my_function()

The `def` statement introduces a function definition. It expects a function name, parentheses, any parameters the function will take, and a colon. The function code block must be indented. The parentheses are always required, when defining or calling a function, even if no parameters are used.

In [None]:
# define a function with parameters:
def sound_more_excited(my_string):
    new_string = my_string.upper() + "!!"
    print(new_string)

In [None]:
greeting = "Hi people, this is Python session 3"
greeting

In [None]:
sound_more_excited(greeting)

Note: functions have an internal name-space (or symbol table). The data passed into `sound_more_excited` will be referred to, within the function, as `my_string`.

### 3. Built-in functions and importing libraries
Python has many built-in functions: commands that are part of the basic language.

Check the [documentation](https://docs.python.org/3/library/functions.html) and use whichever are useful to you! Eg. `len()`, `max()`, `print()`

Use help or an internet search to check what parameters they take and learn their behaviors.

In [None]:
# check the length of a string
len("eucalyptus")

In [None]:
# check the length of a list
len([1,3,6,'cactus'])

In [None]:
# get help on a built-in function
max?

In [None]:
# get the largest item in a list
mylist2 = [5,4,3,3827]

max(mylist2)

In [None]:
# sum the items in a list
sum(mylist2)

**Imports**. When built-in functions don't suffice, you can `import` one of the many available libraries. 

Popular examples include `datetime` (handle dates and times), `random` (generate random numbers), `os` (handle your computer's file system), and `math` (mathematics operations). These are each part of the Python Standard Library - documentation [here](https://docs.python.org/3/tutorial/stdlib.html). 

In addition, you can install many other libraries to your computer. The most popular include `numpy` and `pandas` (data manipulation), `matplotlib` and `seaborn` (plotting), and `scikitlearn` (machine learning).

In [None]:
import datetime

In [None]:
# a basic import statement
import random

In [None]:
# generate a random integer between 1 and 6
random.randint(1,6)

In [None]:
# import part of a library
from random import randint
from datetime import datetime

In [None]:
randint(1,6)

In [None]:
# use auto-complete to check which functions are available  (eg. datetime.now())
datetime.

In [None]:
# pandas is conventionally abbreviated to pd, numpy to np
import numpy as np

### 4. NumPy

NumPy, which stands for Numerical Python, is a fundamental package for high performance scientific computing and data analysis.

The NumPy array (ndarray) is a highly efficient way of storing and manipulating numerical data.

In [None]:
import numpy as np
from IPython.display import Image


In [None]:
Image(r'https://raw.githubusercontent.com/worldbank/Python-for-Data-Science/master/July_2019_Poverty_GP/day_1/images/numpy_array_indexing.png')

#### Create NumPy Arrays:

In [None]:
# create a 1 dimensional array array1
my_list=[1,2,3,4]

np.array(my_list)

In [None]:
# create a 2 dimensional array 
array_2d = np.array([[1,2,3,4], [5,6,7,8]])
array_2d

#### Look up info on the array:

In [None]:
my_array=array_2d

In [None]:
# shape of the array
my_array.shape

In [None]:
#data type
my_array.dtype

In [None]:
# Change the data type to float
my_array.astype(float).dtype

#### Slicing and indexing

Similar to other python data structures numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array

In [None]:
# Let's create a numpy array comprising of the integers 0 through 9
arr = np.arange(10)
arr

In [None]:
# You can get the item at index 5 
arr[5]

In [None]:
# slice the array to get items between index locations
arr[5:8]

In [None]:
# Indexing with big arrays looks complex ... but it's powerful

Image(r'https://raw.githubusercontent.com/worldbank/Python-for-Data-Science/master/July_2019_Poverty_GP/day_1/images/2dslice.png')

### 7. Pandas DataFrames
Pandas, a library written by Wes McKinney, is a great tool for data manipulation and analysis. It provides two classes:
* a Series object, which handles a single column of data;
* a DataFrame object, which handles multiple columns (like an Excel spreadsheet).

You can build your own DataFrames or read in from other sources like CSVs or JSON. Pandas handles missing data beautifully; lets you sort, operate on and and merge datasets; provides plotting capabilities; and handles time series data (among other advantages).

#### (a) Creating Series and DataFrames

In [None]:
# import the library
import pandas as pd

In [None]:
# create a series by passing a list

towns = pd.Series(['Cardiff', 'Swansea', 'Abergavenny','Machynlleth'])
towns

In [None]:
# lists are a great building block for dataframes
towns = ['Cardiff', 'Swansea', 'Abergavenny','Machynlleth']
populations = [335145, 230300, 12515, 2235]
bakeries = [43, 25, 12, 4]

In [None]:
# create an empty dataframe
towns_df = pd.DataFrame()
towns_df

In [None]:
# use square bracket notation to add a column
towns_df['name'] = towns
towns_df

In [None]:
# add another column

towns_df['population'] = populations
towns_df

In [None]:
# the columns have different data types (dtypes)

towns_df.dtypes

In [None]:
# quick way to create dataframes: pass a dictionary 
# each key-value pair specifies (i) the column's name; (ii) the corresponding data

towns_df = pd.DataFrame({'name': towns,
                         'population': populations,
                         'bakeries': bakeries})

towns_df

#### (b) View and select data

In [None]:
# the .head() method shows the top rows

towns_df.head(2)

In [None]:
# check how many rows and columns
towns_df.shape

In [None]:
# Inspect only one series using square bracket notation

towns_df['population']

In [None]:
# Or dot notation

towns_df.population

In [None]:
# Standard Python indexing works

towns_df.bakeries[:3]

#### (c) Operate on columns

In [None]:
# Use Boolean indexing to inspect values based on a condition

towns_df[towns_df.name == 'Abergavenny']

In [None]:
# Create a new column with math outputs

towns_df['bakeries_per_capita'] = towns_df.bakeries / towns_df.population
towns_df['people_per_bakery'] = towns_df.population / towns_df.bakeries

towns_df

In [None]:
# Use a single column's value to select data

towns_df.loc[towns_df.people_per_bakery < 150]

In [None]:
# Use the .sort_values() method

towns_df.sort_values(by = 'people_per_bakery')

#### (d) Plot outputs

In [None]:
# Plot charts using the .plot() method 

towns_df.plot(x = 'name', y = 'bakeries_per_capita', kind = 'bar', title = 'Some great towns to visit')