# Module 4: Working with data and files

So far, we’ve figured out how to do many things with Python already:
- Data Types: integers, floats, strings
- Variables: hold one piece of data of specific type 
- Containers: hold many pieces of data
- Statements and Expressions
- Branching
- Looping
- Functions


Next, we will learn to work with hands-on data and introduce useful packages to practice with coding using Python.  

In this module, we will introduce [pandas](https://pandas.pydata.org/docs/getting_started/overview.html), a powerful Python package that provides fast, flexible, and expressive data structures that are useful for doing practical, real-world data analysis in Python. We'll learn about how to load, clean and process data, and conduct basic descriptive analysis

pandas is based on [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html), one of the most important packages for numerical computing in Python.


Relevant reading: McKinney's Python for Data Analysis Chapters 4, 5, 6


**Setup virtual environment**  
Before you start, let's create a conda environment and install all necessary packadges for this course, following the steps below:
1. Download `environment.yml` to your working directories (i.e. course folder)
1. Open your anaconda prompt (or a terminal window if using MAC), change directories to your working folder of this course (where your `environment.yml` is located), and run the following commands, one at a time:  
    `conda config --add channels conda-forge`   
    `conda config --set channel_priority strict`  
    `conda env create --file environment.yml --force`  
1. It may take a while to complete all the installation, once done, run the following:   
    `conda activate ppua6202`    
    `python -m ipykernel install --sys-prefix --name ppua6202 --display-name "Python (ppua6202)"`   
    `jupyter lab`  
    
You now have a conda environment with all the packages needed for this course, and a Jupyter kernel installed in the environment. Next time you start Jupyter Lab:  
1. In your command promt/terminal, run `activate ppua6202` if you're on Windows; or `source activate ppua6202` if you're on Mac.
2. Then run `jupyter lab` to open Jupyter Lab in your browser 

**Download data**  
Data for this week can be downloaded from Canvas.

In [12]:
# import these packages to work with data
import numpy as np
import pandas as pd

A **module** is a python file (.py) containing Python definitions, statements or functions. 

**Packages** are a way of structuring Python’s module namespace by using “dotted module names”.

**Library** is simply a generic term for a bunch of code that was designed with the aim of being usable by many applications. It coule contain a package or multiple related packages, or a collection of modules or a single module.

Check [this issue post on StackOverflow](https://stackoverflow.com/questions/19198166/whats-the-difference-between-a-module-and-a-library-in-python#:~:text=When%20a%20module%2Fpackage%2Fsomething,cannot%20%22run%20a%20library%22.)

## 1. Introduce numpy arrays

In [13]:
# a python list is a basic data type
my_list = [1, 2, 3, 4]
my_list

[1, 2, 3, 4]

In [14]:
# a numpy array is like a list, but faster and more compact
my_array = np.array([1, 2, 3, 4])
my_array

array([1, 2, 3, 4])

In [15]:
# you can create a numpy array from an existing list too
my_array = np.array(my_list)
my_array

array([1, 2, 3, 4])

In [16]:
# check the variable type
type(my_array)

numpy.ndarray

In [17]:
# check data type
my_array.dtype

dtype('int64')

numpy has several mathematical functions built-in. Here are some examples.

In [18]:
np.mean(my_array)

2.5

In [19]:
np.std(my_array)

1.118033988749895

pandas has two primary data structures we will work with: Series and DataFrames

## 2. Introduce pandas Series

In [20]:
# a pandas series is based on a numpy array - it's fast, compact, and has more functionality
my_list = [1, 2, 3, 4]
my_series = pd.Series(my_list)
my_series

0    1
1    2
2    3
3    4
dtype: int64

In [21]:
# you can create a new Series by passing in a list variable or array
# series can contain data types other than just integers
series2 = pd.Series(['a', 'b', 'c', 'd'])
series2

0    a
1    b
2    c
3    d
dtype: object

In [22]:
# you can change a series's index
series2.index = ['w', 'x', 'y']
series2

ValueError: Length mismatch: Expected axis has 4 elements, new values have 3 elements

## 3. Introduce pandas DataFrames

In [23]:
# a pandas dataframe is like a table where each column is a series
# a dict can contain multiple lists and label them
my_dict = {'hh_income'  : [75125, 22075, 31950, 115400],
           'home_value' : [525000, 275000, 395000, 985000]}
my_dict

{'hh_income': [75125, 22075, 31950, 115400],
 'home_value': [525000, 275000, 395000, 985000]}

In [24]:
# a pandas dataframe can contain multiple columns/series
# you can create a dataframe by passing in a list, array, series, or dict
df = pd.DataFrame(my_dict)
df

Unnamed: 0,hh_income,home_value
0,75125,525000
1,22075,275000
2,31950,395000
3,115400,985000


In [25]:
# the row labels in the index are accessed by the .index attribute of the DataFrame object
df.index

RangeIndex(start=0, stop=4, step=1)

In [26]:
# the column labels are accessed by the .columns attribute of the DataFrame object
df.columns

Index(['hh_income', 'home_value'], dtype='object')

In [27]:
# the data values are accessed by the .values attribute of the DataFrame object
df.values

array([[ 75125, 525000],
       [ 22075, 275000],
       [ 31950, 395000],
       [115400, 985000]])

In [28]:
# dataframe shape as rows, columns
df.shape

(4, 2)

In [29]:
# datatypes of the columns
df.dtypes

hh_income     int64
home_value    int64
dtype: object

## 4. Working with CSV files

In practice, you'll work with data by loading a dataset file into pandas. CSV is the most common format. But pandas can also ingest tab-separated data, JSON, and proprietary file formats like Excel .xlsx files, Stata, SAS, and SPSS.

Notice what pandas's `read_csv` function does:

1. recognize the header row and get its variable names
1. read all the rows and construct a pandas DataFrame, an assembly of pandas Series
1. construct a unique index, beginning with zero
1. infer the data type of each variable (ie, column)

In [30]:
# pandas can load CSV files as DataFrames - it pulls column labels from the first row of the data file
df = pd.read_csv('../data/rain.csv') # path relative to notebook file
df

Unnamed: 0,month,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2
6,jul,0.8
7,aug,0.7
8,sep,
9,oct,3.9


In [31]:
# dataframe shape as rows, columns
df.shape

(12, 2)

In [32]:
# datatypes of the columns
df.dtypes

month               object
rainfall_inches    float64
dtype: object

#### We can select subsets of the rows by indexing, and select specific columns by their name:

In [33]:
# a column is a pandas series
type(df['rainfall_inches'])

pandas.core.series.Series

In [34]:
# so is a row
type(df.loc[0])

pandas.core.series.Series

In [35]:
# view a column
df['rainfall_inches']

0     5.3
1     5.4
2     4.8
3     4.7
4     3.3
5     1.2
6     0.8
7     0.7
8     NaN
9     3.9
10    4.5
11    5.9
Name: rainfall_inches, dtype: float64

In [36]:
# sort the values
df.sort_values('rainfall_inches', ascending=False)

Unnamed: 0,month,rainfall_inches
11,dec,5.9
1,feb,5.4
0,jan,5.3
2,mar,4.8
3,apr,4.7
10,nov,4.5
9,oct,3.9
4,may,3.3
5,jun,1.2
6,jul,0.8


In [37]:
# view the "head" of the dataframe
df.head()

Unnamed: 0,month,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3


In [38]:
# or view its "tail"
df.tail()

Unnamed: 0,month,rainfall_inches
7,aug,0.7
8,sep,
9,oct,3.9
10,nov,4.5
11,dec,5.9


In [39]:
# first 6 elements
df['rainfall_inches'][:6]

0    5.3
1    5.4
2    4.8
3    4.7
4    3.3
5    1.2
Name: rainfall_inches, dtype: float64

In [40]:
# final 6 rows
df[6:]

Unnamed: 0,month,rainfall_inches
6,jul,0.8
7,aug,0.7
8,sep,
9,oct,3.9
10,nov,4.5
11,dec,5.9


In [41]:
# summary descriptive stats
df['rainfall_inches'].describe()

count    11.000000
mean      3.681818
std       1.923444
min       0.700000
25%       2.250000
50%       4.500000
75%       5.050000
max       5.900000
Name: rainfall_inches, dtype: float64

It silently handles the missing value for September and gave the correct statistical results. These are essentially Numpy functions, but in pandas we can now deal with multiple data types and columns.

In [42]:
df['rainfall_inches'].mean()

3.681818181818181

In [43]:
df['rainfall_inches'].std()

1.9234438810727919

In [50]:
# now it's your turn
# how would you compute the total rainfall inches between march and august?
df[2:8]['rainfall_inches'].sum()

15.5

#### More DataFrame functionality

In [51]:
# load a new data file
df2 = pd.read_csv('../data/cities.csv')

In [52]:
df2.shape

(8, 2)

In [53]:
# you can view the first few or the last few rows of a DataFrame with the .head() or .tail() methods
df2.head(8)

Unnamed: 0,city,state
0,san francisco,california
1,phoenix,arizona
2,seattle,washington
3,dallas,texas
4,denver,colorado
5,chicago,illinois
6,portland,oregon
7,miami,florida


In [54]:
# you can add a new (empty) column to a DataFrame
df2['country'] = np.nan
df2

Unnamed: 0,city,state,country
0,san francisco,california,
1,phoenix,arizona,
2,seattle,washington,
3,dallas,texas,
4,denver,colorado,
5,chicago,illinois,
6,portland,oregon,
7,miami,florida,


In [55]:
# you can update the values of an entire column in one fell swoop
df2['country'] = 'USA'
df2

Unnamed: 0,city,state,country
0,san francisco,california,USA
1,phoenix,arizona,USA
2,seattle,washington,USA
3,dallas,texas,USA
4,denver,colorado,USA
5,chicago,illinois,USA
6,portland,oregon,USA
7,miami,florida,USA


In [56]:
# you can set the values of a column (aka, Series) in the DataFrame to a list of values
df2['country'] = ['USA', 'United States'] * 4
df2

Unnamed: 0,city,state,country
0,san francisco,california,USA
1,phoenix,arizona,United States
2,seattle,washington,USA
3,dallas,texas,United States
4,denver,colorado,USA
5,chicago,illinois,United States
6,portland,oregon,USA
7,miami,florida,United States


In [57]:
# you can use fast vectorized methods on a pandas series (aka, a column in our dataframe)
df2['country'].str.replace('United States', 'USA')
df2

Unnamed: 0,city,state,country
0,san francisco,california,USA
1,phoenix,arizona,United States
2,seattle,washington,USA
3,dallas,texas,United States
4,denver,colorado,USA
5,chicago,illinois,United States
6,portland,oregon,USA
7,miami,florida,United States


that didn't do anything to our dataframebecause .str.replace() returns the updated version - it doesn't perform the operation in place

In [58]:
# we need to capture the updated values when they get returned
df2['country'] = df2['country'].str.replace('United States', 'USA')
df2

Unnamed: 0,city,state,country
0,san francisco,california,USA
1,phoenix,arizona,USA
2,seattle,washington,USA
3,dallas,texas,USA
4,denver,colorado,USA
5,chicago,illinois,USA
6,portland,oregon,USA
7,miami,florida,USA


In [59]:
# you can change the column names
df2.columns = ['city_name', 'state_name', 'nation']
df2

Unnamed: 0,city_name,state_name,nation
0,san francisco,california,USA
1,phoenix,arizona,USA
2,seattle,washington,USA
3,dallas,texas,USA
4,denver,colorado,USA
5,chicago,illinois,USA
6,portland,oregon,USA
7,miami,florida,USA


In [60]:
# or just rename a single column, passing a dict into the rename() method
df2 = df2.rename(columns={'city_name':'city'})
df2

Unnamed: 0,city,state_name,nation
0,san francisco,california,USA
1,phoenix,arizona,USA
2,seattle,washington,USA
3,dallas,texas,USA
4,denver,colorado,USA
5,chicago,illinois,USA
6,portland,oregon,USA
7,miami,florida,USA


In [61]:
df2.drop(columns=['nation'])

Unnamed: 0,city,state_name
0,san francisco,california
1,phoenix,arizona
2,seattle,washington
3,dallas,texas
4,denver,colorado
5,chicago,illinois
6,portland,oregon
7,miami,florida


In [62]:
# you could combine the value into one column seperated by comma
df2['city_state'] = df2['city'] + ', ' + df2['state_name']
df2.head()

Unnamed: 0,city,state_name,nation,city_state
0,san francisco,california,USA,"san francisco, california"
1,phoenix,arizona,USA,"phoenix, arizona"
2,seattle,washington,USA,"seattle, washington"
3,dallas,texas,USA,"dallas, texas"
4,denver,colorado,USA,"denver, colorado"


In [63]:
# you could also split them
# Adding two new columns to the existing dataframe.
# splitting is done on the basis of underscore.
df2[['city1','state1']] = df2['city_state'].str.split(", ", expand=True)
df2.head()

Unnamed: 0,city,state_name,nation,city_state,city1,state1
0,san francisco,california,USA,"san francisco, california",san francisco,california
1,phoenix,arizona,USA,"phoenix, arizona",phoenix,arizona
2,seattle,washington,USA,"seattle, washington",seattle,washington
3,dallas,texas,USA,"dallas, texas",dallas,texas
4,denver,colorado,USA,"denver, colorado",denver,colorado


In [64]:
# you can save your DataFrame as a csv file
df2.to_csv('../data/my_new_dataset.csv')

In [None]:
# now it's your turn
# rename all the columns to new names and save to disk as a new file


### Filtering on values

You can easily filter a dataframe for one or more conditions based on the values in a column. Below we filter df to select only months with less than 3 inches of rainfall.  

In [67]:
df[df['rainfall_inches'] < 3]

Unnamed: 0,month,rainfall_inches
5,jun,1.2
6,jul,0.8
7,aug,0.7


In [68]:
df['rainfall_inches'] < 3

0     False
1     False
2     False
3     False
4     False
5      True
6      True
7      True
8     False
9     False
10    False
11    False
Name: rainfall_inches, dtype: bool

In [69]:
mask = df['rainfall_inches'] < 3
df[mask]

Unnamed: 0,month,rainfall_inches
5,jun,1.2
6,jul,0.8
7,aug,0.7


In [70]:
pd.isnull(df)

Unnamed: 0,month,rainfall_inches
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
5,False,False
6,False,False
7,False,False
8,False,True
9,False,False


In [71]:
pd.notnull(df['rainfall_inches'])

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9      True
10     True
11     True
Name: rainfall_inches, dtype: bool

You can also select rows based on the values of more than one column. Just remember to nest the individual conditions within parentheses.

In [72]:
df[(df['month'] != 'jul') & (pd.notnull(df['rainfall_inches']))]

Unnamed: 0,month,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2
7,aug,0.7
9,oct,3.9
10,nov,4.5
11,dec,5.9


What's that funny ampersand doing there? We'll learn more next week!

In [73]:
# now it's your turn
# what is the average rainfall in months with at least 2 inches of rain?
df[df['rainfall_inches'] >= 2]['rainfall_inches'].mean()

4.725

### Editing strings and changing data types

In [74]:
df['rainfall_inches']

0     5.3
1     5.4
2     4.8
3     4.7
4     3.3
5     1.2
6     0.8
7     0.7
8     NaN
9     3.9
10    4.5
11    5.9
Name: rainfall_inches, dtype: float64

In [75]:
# count non-nan cells
df['rainfall_inches'].count()

11

In [76]:
# count all cells, including nans
len(df['rainfall_inches'])

12

In [77]:
df['rainfall_inches'].astype(int)

ValueError: Cannot convert non-finite values (NA or inf) to integer

In [78]:
# fill missing values with a default value (or you could drop them instead!)
df['rainfall_inches'] = df['rainfall_inches'].fillna(0)
df['rainfall_inches']

0     5.3
1     5.4
2     4.8
3     4.7
4     3.3
5     1.2
6     0.8
7     0.7
8     0.0
9     3.9
10    4.5
11    5.9
Name: rainfall_inches, dtype: float64

In [79]:
# now convert to int
df['rainfall_inches'].astype(int)

0     5
1     5
2     4
3     4
4     3
5     1
6     0
7     0
8     0
9     3
10    4
11    5
Name: rainfall_inches, dtype: int64

In [80]:
# remember to re-assign to capture the output!
df['rainfall_inches'] = df['rainfall_inches'].astype(int)

In [81]:
df[df['month'].str.contains('j')]

Unnamed: 0,month,rainfall_inches
0,jan,5
5,jun,1
6,jul,0


In [82]:
# you can do stats on a filtered subset
df[df['month'].str.contains('j')]['rainfall_inches'].mean()

2.0