# Lab 1 - Python, Pandas and Matplotlib
- **Author:** Emily Aiken ([emilyaiken@berkeley.edu](mailto:emilyaiken@berkeley.edu)) (Adapted from labs by Qutub Khan Vajihi and Dimitris Papadimitriou)
- **Date:** January 26, 2022
- **Course:** INFO 251: Applied Machine Learning

### Learning Objectives:

* Know what is good style when writing Python code
* Learn some useful Python features that you may not already know about
* Work with DataFrames using the Pandas library
* Produce basic graphs using the Matplotlib library, and learn some tips to produce readable and beautiful graphs

### Feedback:

After the lab, please provide feedback via this anonymous [google form](https://forms.gle/rHvftuoLpnEHXSNX9). It should take about 20 seconds!


## 1. Python Code Style
Below are some key points for Python coding style. Most importantly, remember that code is for people to read --- and in this class, for people to grade --- so use your best judgement to make your code readable. For more information, visit Guido van Rossum's python style guide: [PEP 8 -- Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/). 

*Agenda: Line length, variable names, strings, whitespace, blank lines, comments, imports*

* **Line length**:
    Maximum line length is 79 characters. As a rule of thumb, in Jupyter notebooks, just don't go over the length of the box on a laptop screen. If you have a very long line of code, you can break it using a backslash.
    




* **Variable names:** Make variable names (nouns) and function names (verbs) descriptive.

In [None]:
# Correct
hyperparameter_grid = {1, 2, 3}
number_of_iterations = 20

# Incorrect
a = 12
var = 10

* **Strings:** Be consistent between ' and ".

* **Whitespace:** 
Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not). Avoid extraneous whitespaces immediately inside parentheses, brackets, or braces

In [None]:
# Correct:
i = 0
i = i + 1
i += 1
lst = [0, 1, 2]
tple = (0, 1, 2)
st = {0, 1, 2}
print(lst[0])

# Incorrect:
i=0
i+=1
lst = [ 0, 1, 2 ]
tple = ( 0, 1, 2 )
st = { 0, 1, 2 }
print( lst[ 0 ] )

* **Blank lines:**
Maintain two lines between all top-level things (functions, classes, imports, etc)

In [None]:
import numpy 
import pandas

def foo(x):
    if x >= 0:
        return math.sqrt(x)
    else:
        return None

def bar(x):
    if x < 0:
        return None
    return math.sqrt(x)

* **Comments:**
    For readability, try to always explain the functionality of your lines by commenting. Comments can come before blocks of code, or inline for single lines of code.

In [None]:
# Creating a dictionary and inverting it
my_map = {'AML':0,'Lab':1}
inv_map = {v: k for k, v in my_map.items()} #inverting the dict

# Printing and returning the dictionary
print(inv_map)
inv_map

* **Imports** - Imports from the same class/package should be on the same line. Imports from different classes/packages should be on different lines. 

In [None]:
# Correct:
import pandas
import matplotlib

# Wrong:
import pandas, matplotlib

# Correct:
from sklearn.metrics import r2_score, roc_auc_score

## 2. Some Useful Python Features
*Agenda: Reading/writing files, file paths, enumerate, lambda functions, zip*

* **Reading and writing files:** Use "with" to open the files, which will make sure the files are closed automatically

In [None]:
# Use "with" to open files...
with open('test.txt', 'r') as f:
    for line in f:
        print(line)

In [None]:
# ...otherwise you explicitly need to 'open' and 'close' files.
f = open('test.txt', 'r')
for line in f:
    print(line)

f.close()

* **File paths:** Concatenate path parts with **os.path.join** rather than with string concatenation

In [None]:
import os

# Correct
country_name = 'USA'
month = 'January'
path = os.path.join('a', 'b', country_name, month)
print(path)

# Less correct
path = 'a/b/' + country_name + '/' + month
print(path)

* **Enumerate**: great for getting index and elements of an iterator at the same time. 

In [None]:
# Use enumerate to get the index (which comes first) and the element (which comes second) at the same time...
for i, x in enumerate([1, 2, 3]):
    print('Index:', i)
    print('Element:', x)

In [None]:
# ...otherwise you'll have to use a "flag variable", which isn't very elegant
flag = 0
for x in [1, 2, 3]:
    print('Index:', flag)
    print('Element:', x)
    flag += 1

* **Lambda functions**: A Lambda Function is a small, anonymous function — anonymous in the sense that it doesn’t actually have a name. Lambda functions are used a lot with pandas.

In [None]:
# Lambda function with one variable
x = lambda a : a*3 + 3
print(x(3)) # prints '12'

# A less elegant way to code up the function with one variable
def my_function(a):
    return a*3 + 3
print(my_function(3))

In [None]:
# Lambda function with two variables
x = lambda a, b : a * b
print(x(5, 6)) # prints '30'

* **Zipping**: The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together etc. If the passed iterators have different lengths, the iterator with the least items decides the length of the new iterator. (https://www.w3schools.com/python/ref_func_zip.asp). It's a great way of pairing together two lists.

In [None]:
# Exampling zipping with two lists of the same length
products = ['table', 'chair', 'sofa']
prices = [50, 20, 200]

for product, price in zip(products, prices):
    print('Product: {}, Price: {}'.format(product, price))

In [None]:
# Exampling zipping with two lists of different lengths -- the resulting zip object is the shorter length
products = ['table', 'chair', 'sofa', 'bed']
prices = [50, 20, 200]

for product, price in zip(products, prices):
    print('Product: {}, Price: {}'.format(product, price))

## 3. Pandas
*Agenda: Data loading, viewing, selection, grouping/aggregation, mutation, string operations*

#### 3.1 Load the data

In [None]:
import pandas as pd

# Loading a csv
auto_df = pd.read_csv('Auto.csv')

# Other loading tricks
auto_df1 = pd.read_csv('Auto.csv', nrows=10) # Useful if dataset has two many observations to fit in memory
auto_df2 = pd.read_csv('Auto.csv', usecols=['mpg', 'cylinders']) # Useful if you only need a few columns

- mpg: miles per gallon
- cylinders: Number of cylinders between 4 and 8
- displacement: Engine displacement (cu. inches)
- horsepower: Engine horsepower
- weight: Vehicle weight (lbs.)
- acceleration: Time to accelerate from 0 to 60 mph (sec.)
- year: Model year (modulo 100)
- origin: Origin of car (1. American, 2. European, 3. Japanese)
- name: Vehicle name

#### 3.2 Viewing the data

In [None]:
# Dimensions of the dataframe
auto_df.shape

In [None]:
# Display first few rows
display(auto_df.head())

In [None]:
# Display last few rows
auto_df.tail()

In [None]:
# Quick descriptive stats
auto_df.describe(include='all')

In [None]:
# Dataset types
auto_df.dtypes

#### 3.3 Selection

There are multiple ways to select data from a pandas dataframe. Here are a few options...

In [None]:
# Select columns using the double bracket notation
auto_df[['mpg','cylinders']] 

In [None]:
# Select multiple rows by indexing in as though the dataframe were a list -- note that this notation ignores 
# the dataframe's index
auto_df[0:3]

In [None]:
# Select a row or rows using "loc", which corresponds to the value of the index
auto_df.loc[0]
#auto_df.loc[0:2]
#auto_df.loc[0:2, ['year','name']]

In [None]:
# Select a row or rows using "iloc", which ignores the index 
auto_df_adjusted_index = auto_df.copy()
auto_df_adjusted_index.index = auto_df_adjusted_index.index + 3
auto_df.iloc[0]
#auto_df.iloc[0:2]
#auto_df.iloc[0:2, 3:5] # Note that iloc also using numbers to denote the columns selected


In [None]:
# Filtering by column values
auto_df[(auto_df['mpg'] < 18.0)&(auto_df['year']==70)]

In [None]:
# Selecting unique values from a single columns
auto_df['cylinders'].unique()

In [None]:
# Selecting unique values from a set of columns
auto_df[['cylinders', 'year']].drop_duplicates()

#### 3.4 Aggregation and Grouping

In [None]:
# Basic aggregations to the entire dataframe: You can just apply the functions directly.
auto_df.mean()
#auto_df.std()
#auto_df.min()
#auto_df.max()
#auto_df.median()
#auto_df.mode()

In [None]:
# Grouping: Creates a "pandas groupby" object
auto_df.groupby('origin')

In [None]:
# Grouped aggregations: Apply aggregations to the "pandas groupby" object
auto_df.groupby('origin').mean()
auto_df.groupby('origin').agg('mean') # Equivalent syntax

In [None]:
# Tip: Use as_index=False in groupby to keep the groups as a regular column
auto_df.groupby('origin', as_index=False).mean()

#### 3.5 Mutations

In [None]:
# Basic functions of one or more columns: Use the intuitive syntax
auto_df['weight_increment'] = auto_df['weight'] + 1
auto_df['acceleration_times_mpg'] = auto_df['acceleration']*auto_df['mpg']

In [None]:
# More compext functions: Use "apply"
auto_df['mpg_string'] = auto_df['mpg'].apply(lambda x: str(x) + ' MPG') # Apply with a single column

# Apply using multiple columns -- much slower, and don't forget the "axis=1"
auto_df['name_year'] = auto_df.apply(lambda row: row['name'] + ': ' + str(row['year']), axis=1)

auto_df.head()

In [None]:
# Map is occasionally an elegant alternative to apply
cust_map = {8:0, 4:1} # A map is just a dictionary mapping an input value to an output value
auto_df['cylinders_new'] = auto_df['cylinders'].map(cust_map)
auto_df

#### 3.6 String Operations

In [None]:
# Make upper case
auto_df['name'].str.upper()

In [None]:
# Make lower case
auto_df['name'].str.upper()

In [None]:
# Split string into words
auto_df['name'].str.split(' ')

In [None]:
# Join list of strings together
auto_df['name'].str.split(' ').str.join('-')

## 4. Matplotlib

In [None]:
# Enable inline plotting of matplotlib figures, and import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

#### 4.1 Boxplots

In [None]:
plt.figure(figsize=(10, 7))
plt.boxplot(auto_df['mpg'])
plt.show()

#### 4.2 Histograms

In [None]:
plt.figure()
plt.hist(auto_df['cylinders'], color='Red')
plt.show()

#### 4.3 Scatter Plots

In [None]:
plt.figure()
plt.scatter(auto_df['mpg'], auto_df['weight'], alpha=0.2) # Low alpha makes overlapping markers readable
plt.show()

#### 4.4 Bar Plots

In [None]:
plt.figure()
plt.barh(auto_df['year'], auto_df['weight'])
plt.show()

#### 4.5 Making plots readable: Title, axes labeling, legends, subplots and more

In [None]:
# Title, axis labels, and legend
plt.figure()
plt.scatter(auto_df['mpg'], auto_df['weight'], alpha=0.2, label='Cars (N=%i)' % len(auto_df)) # Label for legend

plt.title('Car weight as a function of gas mileage', fontsize='x-large')
##plt.xlabel('Miles per gallon')
#plt.ylabel('Weight (pounds)')
#plt.legend(loc='best')

# The right and top spines are ugly -- let's remove them
#plt.gca().spines['top'].set_visible(False)
#plt.gca().spines['right'].set_visible(False)

plt.show()

In [None]:
# Subplots
fig, ax = plt.subplots(2, 2, figsize=(10, 6))
ax = ax.flatten() # Turns the axes object into a 1D array instead of a 2D array -- convenient for indexing

ax[0].boxplot(auto_df['mpg'])
ax[1].hist(auto_df['cylinders'], color='Red')
ax[2].scatter(auto_df['mpg'], auto_df['weight'], alpha=0.2)
ax[3].barh(auto_df['year'], auto_df['weight'])

# Note that the syntax for the title is slightly different for subplots. Syntax is likewise a little different
# for setting axis labels.
ax[0].set_title('Distribution of gas mileage')
ax[1].set_title('Distribution of number of cylinders')
ax[2].set_title('Gas mileage vs. car weight')
ax[3].set_title('Weight by year')

# Again, turning off the top and right spines
for a in range(len(ax)):
    ax[a].spines['top'].set_visible(False)
    ax[a].spines['right'].set_visible(False)

plt.tight_layout() # Always use tight_layout to maximize space in the plot
plt.show()

#### 4.6 Making plots beautiful: Seaborn

In [None]:
import seaborn as sns
sns.set(font_scale=1.5) # Convenient way to set the font scale for all parts of the plot at the same time

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].scatter(auto_df['mpg'], auto_df['weight'], alpha=0.2)
ax[1].hist(auto_df['cylinders'], color='Red')

plt.show()

**Bonus**: Seaborn also has some beautiful built-in plots. If there is time, try experimenting with any of the following plots from seaborn using the auto_df data: [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html), [violinplot](https://seaborn.pydata.org/generated/seaborn.violinplot.html), or [kernel density estimate](https://seaborn.pydata.org/generated/seaborn.kdeplot.html). 

## 5. Bonus: Some Pandas Excercise Questions

#### (Adapted from Introduction to Statistical Learning, James et al. (2013))


Using the 'Auto.csv' dataset that we utilized earlier, try to answer the below questions - 

a) Are there missing values? Show atleast two ways to check for null values in a dataframe.

b) Which predictors are quantitative and which are qualitative?

Write you answer below - 

c) What is the *range* of **mpg** and **cylinders**?

d) What is the mean and standard deviation of **weight** and **acceleration**?

e) Now remove the 10th through 85th observations, and for the remaining data report the min,max, mean, and standard deviation of **mpg**.

f) What is max weight per year?