# Sp25 PDM Intro to Data
Built and presented by Shivani Sahni and Rahil Shaik

March 19th 2025

### Section 1: Introduction to Jupyter Notebooks

#### 1.1: Introduction to Jupyter Notebooks:
Jupyter Notebooks are interactive documents that combine code, text, and visualizations, making them ideal for data analysis and teaching.​ These are commonplace in research, machine learning, and quantitative finance settings to perform exploratory data work. It enables us to run different experiments to see how we can improve a model's performance in a streamlined and convenient manner.

#### 1.2: Operating a Jupyter Notebook:

- Running Cells: Each notebook consists of cells that can contain code or text. To execute a code cell, click on it and press `Shift + Enter`. There is also a button when you hover a cell that resembles a play button that allows you to run the cell. 

- Creating Cells: You can make two types of cells in python notebooks: markdown and code. Markdowns are generally used to add explanatory text around your code cells. Code cells are used for... coding! There are options at the top taskbar to choose between markdown and code. If you double clik into this cell you can see the scripting for this markdown! 

#### 1.3: Understanding how Kernel's work
A kernel is the computational engine that executes the code in the notebook. We will select a python kernel to execute the cells in this python notebook. If the kernel stops or "dies", you can restart it with the above taskbar using 'Kernel' > 'Restart'.


### Section 2: Python and Pandas Basics

#### 2.1: Setting up your Python environment
There are a few options here including installing Python to your local system, creating a Python virtual environment (venv, conda). Today we will create a python venv virtual environment because they are genearlly lightweight and a major advantage being that you can create isolated environments that use different versions of libraries or Python itself.

If you are using macOS, you need to install Homebrew, which helps manage packages easily (I think you guys all have macOS). Access your terminal and run the below commands:

`/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"`

`echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile`

`eval "$(/opt/homebrew/bin/brew shellenv)" `

To ensure you have installed brew, run this command

`brew --version`

Then install python

`brew install python`

And check if it is installed with

`python3 --version`

`pip --version`

Pip is a package manger, if any point you get `ModuleNotFoundError`, you can use pip to install those packages. I have listed the package requirements for this project in the 'requirements.txt' file, we can use pip to install them. 

`pip3 install -r requirements.txt`

Now we can create a python virtual environment for this project using the below commands

`python -m venv pdmdata` or `python3 -m venv pdmdata`

`source pdmdata/bin/activate`


Now you're ready to start coding!


#### 2.2: Basics of Python

First we'll talk about variables, variable types, and how python interprets and stores data.

In [50]:
# these are a bunch of package imports, the great thing about coding in 2025 is the grunt work is 
# almost always done for you so you can just import packages that do tasks for you

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
import seaborn as sns
import util

In [13]:
# Integer
x = 10  
print(x)

10


In [14]:
# Float
y = 10.5  
print(y)  # <class 'float'>

10.5


In [15]:
# String
name = "Leponda"
print(name)  # <class 'str'>

Leponda


In [16]:
# Boolean
is_student = True
print(is_student)  # <class 'bool'>

True


In [24]:
pledges = ["gurnoor", "arjun", "sarah", "katie", "sadie", "aathma", "jay"]
pledge_points = [4, 4, -1, 8, 16, 4, 4] # as of 03/17 at 7:21 PM

print(pledges)  # <class 'list'>
print(pledge_points)  # <class 'list'>

['gurnoor', 'arjun', 'sarah', 'katie', 'sadie', 'aathma', 'jay']
[4, 4, -1, 8, 16, 4, 4]


In [25]:
# Dictionary (key-value pairs)
trash_pledge_leaderboard = {"top_pledge": "rahil", "bottom_pledge": "shivani"}
print(trash_pledge_leaderboard)  # <class 'dict'>

{'top_pledge': 'rahil', 'bottom_pledge': 'shivani'}


This doesn't look right, let's use the lists we created and update the dictionary with the correct pledge and pledge points.

In [26]:
pledge_to_points = zip(pledges, pledge_points)
pledge_leaderboard = dict(pledge_to_points)

print(pledge_leaderboard)

{'gurnoor': 4, 'arjun': 4, 'sarah': -1, 'katie': 8, 'sadie': 16, 'aathma': 4, 'jay': 4}


Let's use some python syntax to return the top and bottom pledge. We'll start with a brief overview of for loops and if statements in python.

In [43]:
for pledge in pledges:
    print(pledge)

gurnoor
arjun
sarah
katie
sadie
aathma
jay


In [44]:
for i in range(len(pledges)):
    print(pledges[i])

gurnoor
arjun
sarah
katie
sadie
aathma
jay


In [47]:
for pledge, points in pledge_leaderboard.items():
    if points > np.mean(pledge_points):
        print("The pledges doing above average are", pledge)
        

The pledges doing above average are katie
The pledges doing above average are sadie


In [48]:
least_points = float('inf')
most_points = float('-inf')

bottom_pledge = ""
top_pledge = ""

for pledge, points in pledge_leaderboard.items():
    if points < least_points:
        least_points = points
        bottom_pledge = pledge
        
    if points > most_points:
        most_points = points
        top_pledge = pledge

In [49]:
print("top pledge is", top_pledge, "with", pledge_leaderboard[top_pledge], "points")
print("bottom pledge is", bottom_pledge, "with", pledge_leaderboard[bottom_pledge], "points")

top pledge is sadie with 16 points
bottom pledge is sarah with -1 points


#### 2.3: Using Pandas for Exploratory Data Analysis
We will use a data set from sklearn to practice about california housing, pandas enables us to read this information in as a 'dataframe'.

In [None]:
from sklearn.datasets import fetch_california_housing

In [51]:
california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)

Use `.head()` to get the first 5 rows of your data frame

In [58]:
df.head(2)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,PRICE
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585


A few operations on the dataframe you can use to extract information

In [59]:
df.head()  # Show first 5 rows
df.tail(3)  # Show last 3 rows
df.info()  # Get information about the DataFrame
df.describe()  # Summary statistics for numerical columns
df.shape  # Get number of rows and columns
df.columns  # List column names

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   PRICE       20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'PRICE'],
      dtype='object')

You can reference specific column names using brackets

In [61]:
df["MedInc"]  # Select a single column (returns a Series)
df[["MedInc", "PRICE"]]  # Select multiple columns

Unnamed: 0,MedInc,PRICE
0,8.3252,4.526
1,8.3014,3.585
2,7.2574,3.521
3,5.6431,3.413
4,3.8462,3.422
...,...,...
20635,1.5603,0.781
20636,2.5568,0.771
20637,1.7000,0.923
20638,1.8672,0.847


There are two methods to access specific partitions of the dataframe in pandas including `.query()` and bracket notation

In [56]:
df.query("MedInc > 5.6431 and PRICE > 3")

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,PRICE
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
118,5.8596,50.0,6.742627,1.069705,970.0,2.600536,37.84,-122.23,3.276
120,5.9560,41.0,6.851064,1.079787,794.0,2.111702,37.83,-122.24,3.661
...,...,...,...,...,...,...,...,...,...
20487,7.9013,5.0,8.319293,1.120924,2440.0,3.315217,34.30,-118.67,3.930
20494,6.3236,24.0,7.829396,1.065617,1203.0,3.157480,34.29,-118.71,3.020
20501,6.5483,20.0,7.588517,0.894737,699.0,3.344498,34.30,-118.71,3.350
20503,8.2787,27.0,6.935065,1.103896,243.0,3.155844,34.33,-118.75,3.300


In [57]:
df[(df["MedInc"] > 5.6431) & (df["PRICE"] > 3)]

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,PRICE
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
118,5.8596,50.0,6.742627,1.069705,970.0,2.600536,37.84,-122.23,3.276
120,5.9560,41.0,6.851064,1.079787,794.0,2.111702,37.83,-122.24,3.661
...,...,...,...,...,...,...,...,...,...
20487,7.9013,5.0,8.319293,1.120924,2440.0,3.315217,34.30,-118.67,3.930
20494,6.3236,24.0,7.829396,1.065617,1203.0,3.157480,34.29,-118.71,3.020
20501,6.5483,20.0,7.588517,0.894737,699.0,3.344498,34.30,-118.71,3.350
20503,8.2787,27.0,6.935065,1.103896,243.0,3.155844,34.33,-118.75,3.300


In [63]:
df[(df["MedInc"] > 5.6431) | (df["PRICE"] > 3)]

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,PRICE
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20506,6.9454,8.0,6.873103,1.040000,2510.0,3.462069,34.29,-118.73,2.765
20527,1.4653,7.0,3.525794,1.017857,4479.0,8.886905,38.54,-121.79,3.100
20531,5.9629,17.0,6.867133,1.097902,808.0,2.825175,38.58,-121.81,2.860
20533,4.2432,13.0,6.350569,1.053775,2553.0,2.640124,38.54,-121.67,3.265


We can also get specfic rows and columns using `.iloc[]` and `.loc[]`

In [None]:
df.iloc[0]  # Select first row (by index)
df.iloc[:3]  # Select first three rows

df.loc[0, "MedInc"]  # Select a specific value (row 0, column "Name")
df.loc[:, "PRICE"]  # Select all rows for "Age" column

0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: PRICE, Length: 20640, dtype: float64