# Module 4: Working with data and files

So far, we’ve figured out how to do many things with Python already:
- Data Types: integers, floats, strings
- Variables: hold one piece of data of specific type 
- Containers: hold many pieces of data
- Statements and Expressions
- Branching
- Looping
- Functions


Next, we will learn to work with hands-on data and introduce useful packages to practice with coding using Python.  

In this module, we will introduce [pandas](https://pandas.pydata.org/docs/getting_started/overview.html), a powerful Python package that provides fast, flexible, and expressive data structures that are useful for doing practical, real-world data analysis in Python. We'll learn about how to load, clean and process data, and conduct basic descriptive analysis

pandas is based on [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html), one of the most important packages for numerical computing in Python.


Relevant reading: McKinney's Python for Data Analysis Chapters 4, 5, 6


**Setup virtual environment**  
Before you start, let's create a conda environment and install all necessary packadges for this course, following the steps below:
1. Download `environment.yml` to your working directories (i.e. course folder)
1. Open your anaconda prompt (or a terminal window if using MAC), change directories to your working folder of this course (where your `environment.yml` is located), and run the following commands, one at a time:  
    `conda config --add channels conda-forge`   
    `conda config --set channel_priority strict`  
    `conda env create --file environment.yml --force`  
1. It may take a while to complete all the installation, once done, run the following:   
    `conda activate ppua6202`    
    `python -m ipykernel install --sys-prefix --name ppua6202 --display-name "Python (ppua6202)"`   
    `jupyter lab`  
    
You now have a conda environment with all the packages needed for this course, and a Jupyter kernel installed in the environment. Next time you start Jupyter Lab:  
1. In your command promt/terminal, run `activate ppua6202` if you're on Windows; or `source activate ppua6202` if you're on Mac.
2. Then run `jupyter lab` to open Jupyter Lab in your browser 

**Download data**  
Data for this week can be downloaded from Canvas.

In [None]:
# import these packages to work with data
import numpy as np
import pandas as pd

A **module** is a python file (.py) containing Python definitions, statements or functions. 

**Packages** are a way of structuring Python’s module namespace by using “dotted module names”.

**Library** is simply a generic term for a bunch of code that was designed with the aim of being usable by many applications. It coule contain a package or multiple related packages, or a collection of modules or a single module.

Check [this issue post on StackOverflow](https://stackoverflow.com/questions/19198166/whats-the-difference-between-a-module-and-a-library-in-python#:~:text=When%20a%20module%2Fpackage%2Fsomething,cannot%20%22run%20a%20library%22.)

## 1. Introduce numpy arrays

In [None]:
# a python list is a basic data type
my_list = [1, 2, 3, 4]
my_list

In [None]:
# a numpy array is like a list, but faster and more compact
my_array = np.array([1, 2, 3, 4])
my_array

In [None]:
# you can create a numpy array from an existing list too
my_array = np.array(my_list)
my_array

In [None]:
# check the variable type
type(my_array)

In [None]:
# check data type
my_array.dtype

numpy has several mathematical functions built-in. Here are some examples.

In [None]:
np.mean(my_array)

In [None]:
np.std(my_array)

pandas has two primary data structures we will work with: Series and DataFrames

## 2. Introduce pandas Series

In [None]:
# a pandas series is based on a numpy array - it's fast, compact, and has more functionality
my_list = [1, 2, 3, 4]
my_series = pd.Series(my_list)
my_series

In [None]:
# you can create a new Series by passing in a list variable or array
# series can contain data types other than just integers
series2 = pd.Series(['a', 'b', 'c', 'd'])
series2

In [None]:
# you can change a series's index
series2.index = ['w', 'x', 'y']
series2

## 3. Introduce pandas DataFrames

In [None]:
# a pandas dataframe is like a table where each column is a series
# a dict can contain multiple lists and label them
my_dict = {'hh_income'  : [75125, 22075, 31950, 115400],
           'home_value' : [525000, 275000, 395000, 985000]}
my_dict

In [None]:
# a pandas dataframe can contain multiple columns/series
# you can create a dataframe by passing in a list, array, series, or dict
df = pd.DataFrame(my_dict)
df

In [None]:
# the row labels in the index are accessed by the .index attribute of the DataFrame object
df.index

In [None]:
# the column labels are accessed by the .columns attribute of the DataFrame object
df.columns

In [None]:
# the data values are accessed by the .values attribute of the DataFrame object
df.values

In [None]:
# dataframe shape as rows, columns
df.shape

In [None]:
# datatypes of the columns
df.dtypes

## 4. Working with CSV files

In practice, you'll work with data by loading a dataset file into pandas. CSV is the most common format. But pandas can also ingest tab-separated data, JSON, and proprietary file formats like Excel .xlsx files, Stata, SAS, and SPSS.

Notice what pandas's `read_csv` function does:

1. recognize the header row and get its variable names
1. read all the rows and construct a pandas DataFrame, an assembly of pandas Series
1. construct a unique index, beginning with zero
1. infer the data type of each variable (ie, column)

In [None]:
# pandas can load CSV files as DataFrames - it pulls column labels from the first row of the data file
df = pd.read_csv('../data/rain.csv') # path relative to notebook file
df

In [None]:
# dataframe shape as rows, columns
df.shape

In [None]:
# datatypes of the columns
df.dtypes

#### We can select subsets of the rows by indexing, and select specific columns by their name:

In [None]:
# a column is a pandas series
type(df['rainfall_inches'])

In [None]:
# so is a row
type(df.loc[0])

In [None]:
# view a column
df['rainfall_inches']

In [None]:
# sort the values
df.sort_values('rainfall_inches', ascending=False)

In [None]:
# view the "head" of the dataframe
df.head()

In [None]:
# or view its "tail"
df.tail()

In [None]:
# first 6 elements
df['rainfall_inches'][:6]

In [None]:
# final 6 rows
df[6:]

In [None]:
# summary descriptive stats
df['rainfall_inches'].describe()

It silently handles the missing value for September and gave the correct statistical results. These are essentially Numpy functions, but in pandas we can now deal with multiple data types and columns.

In [None]:
df['rainfall_inches'].mean()

In [None]:
df['rainfall_inches'].std()

In [None]:
# now it's your turn
# how would you compute the total rainfall inches between march and august?


#### More DataFrame functionality

In [None]:
# load a new data file
df2 = pd.read_csv('../data/cities.csv')

In [None]:
df2.shape

In [None]:
# you can view the first few or the last few rows of a DataFrame with the .head() or .tail() methods
df2.head(8)

In [None]:
# you can add a new (empty) column to a DataFrame
df2['country'] = np.nan
df2

In [None]:
# you can update the values of an entire column in one fell swoop
df2['country'] = 'USA'
df2

In [None]:
# you can set the values of a column (aka, Series) in the DataFrame to a list of values
df2['country'] = ['USA', 'United States'] * 4
df2

In [None]:
# you can use fast vectorized methods on a pandas series (aka, a column in our dataframe)
df2['country'].str.replace('United States', 'USA')
df2

that didn't do anything to our dataframebecause .str.replace() returns the updated version - it doesn't perform the operation in place

In [None]:
# we need to capture the updated values when they get returned
df2['country'] = df2['country'].str.replace('United States', 'USA')
df2

In [None]:
# you can change the column names
df2.columns = ['city_name', 'state_name', 'nation']
df2

In [None]:
# or just rename a single column, passing a dict into the rename() method
df2 = df2.rename(columns={'city_name':'city'})
df2

In [None]:
df2.drop(columns=['nation'])

In [None]:
# you could combine the value into one column seperated by comma
df2['city_state'] = df2['city'] + ', ' + df2['state_name']
df2.head()

In [None]:
# you could also split them
# Adding two new columns to the existing dataframe.
# splitting is done on the basis of underscore.
df2[['city1','state1']] = df2['city_state'].str.split(", ", expand=True)
df2.head()

In [None]:
# you can save your DataFrame as a csv file
df2.to_csv('../data/my_new_dataset.csv')

In [None]:
# now it's your turn
# rename all the columns to new names and save to disk as a new file


### Filtering on values

You can easily filter a dataframe for one or more conditions based on the values in a column. Below we filter df to select only months with less than 3 inches of rainfall.  

In [None]:
df[df['rainfall_inches'] < 3]

In [None]:
df['rainfall_inches'] < 3

In [None]:
mask = df['rainfall_inches'] < 3
df[mask]

In [None]:
pd.isnull(df)

In [None]:
pd.notnull(df['rainfall_inches'])

You can also select rows based on the values of more than one column. Just remember to nest the individual conditions within parentheses.

In [None]:
df[(df['month'] != 'jul') & (pd.notnull(df['rainfall_inches']))]

What's that funny ampersand doing there? We'll learn more next week!

In [None]:
# now it's your turn
# what is the average rainfall in months with at least 2 inches of rain?


### Editing strings and changing data types

In [None]:
df['rainfall_inches']

In [None]:
# count non-nan cells
df['rainfall_inches'].count()

In [None]:
# count all cells, including nans
len(df['rainfall_inches'])

In [None]:
df['rainfall_inches'].astype(int)

In [None]:
# fill missing values with a default value (or you could drop them instead!)
df['rainfall_inches'] = df['rainfall_inches'].fillna(0)
df['rainfall_inches']

In [None]:
# now convert to int
df['rainfall_inches'].astype(int)

In [None]:
# remember to re-assign to capture the output!
df['rainfall_inches'] = df['rainfall_inches'].astype(int)

In [None]:
df[df['month'].str.contains('j')]

In [None]:
# you can do stats on a filtered subset
df[df['month'].str.contains('j')]['rainfall_inches'].mean()