<img src= "https://static.wixstatic.com/media/9278e7_c8e6664df6e44185b1da6e60e9e8da6c~mv2.png/v1/fill/w_110,h_110,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/black_logo.png">

# Sp25 PDM Intro to Data
Built and presented by Shivani Sahni and Rahil Shaik

March 19th 2025

### Section 1: Introduction to Jupyter Notebooks

#### 1.1: Introduction to Jupyter Notebooks:
Jupyter Notebooks are interactive documents that combine code, text, and visualizations, making them ideal for data analysis and teaching.​ These are commonplace in research, machine learning, and quantitative finance settings to perform exploratory data work. It enables us to run different experiments to see how we can improve a model's performance in a streamlined and convenient manner.

#### 1.2: Operating a Jupyter Notebook:

- Running Cells: Each notebook consists of cells that can contain code or text. To execute a code cell, click on it and press `Shift + Enter`. There is also a button when you hover a cell that resembles a play button that allows you to run the cell. 

- Creating Cells: You can make two types of cells in python notebooks: markdown and code. Markdowns are generally used to add explanatory text around your code cells. Code cells are used for... coding! There are options at the top taskbar to choose between markdown and code. If you double clik into this cell you can see the scripting for this markdown! 

#### 1.3: Understanding how Kernel's work
A kernel is the computational engine that executes the code in the notebook. We will select a python kernel to execute the cells in this python notebook. If the kernel stops or "dies", you can restart it with the above taskbar using 'Kernel' > 'Restart'.


### Section 2: Python and Pandas Basics

#### 2.1: Setting up your Python environment
There are a few options here including installing Python to your local system, creating a Python virtual environment (venv, conda). Today we will create a python venv virtual environment because they are genearlly lightweight and a major advantage being that you can create isolated environments that use different versions of libraries or Python itself.

If you are using macOS, you need to install Homebrew, which helps manage packages easily (I think you guys all have macOS). Access your terminal and run the below commands:

`/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"`

`echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile`

`eval "$(/opt/homebrew/bin/brew shellenv)" `

To ensure you have installed brew, run this command

`brew --version`

Then install python and git

`brew install python`

`brew install git`

Then you can clone the PDM repository using the git script

`git clone https://github.com/rahilisashaik/pdm-intro-to-data.git`

You should then access this directory in visual studio code, google colab, or where ever you would like. I assume it is at your default directory:

`/Users/rahilshaik/pdm-intro-to-data` or `~/pdm-intro-to-data`

For the rest of these instructions, you should be in the built in terminal for your coding environment (colab, vs code, jupyter)

Check if python installed correctly with

`python3 --version`

`pip --version` or `pip3 --version`

Now we can create a python virtual environment for this project using the below commands

`python -m venv pdmdata` or `python3 -m venv pdmdata`

`source pdmdata/bin/activate`

Pip is a package manger, if any point you get `ModuleNotFoundError`, you can use pip to install those packages. I have listed the package requirements for this project in the 'requirements.txt' file, we can use pip to install them. 

`pip3 install -r requirements.txt`


Now you're ready to start coding!


#### 2.2: Basics of Python

First we'll talk about variables, variable types, and how python interprets and stores data.

In [1]:
# these are a bunch of package imports, the great thing about coding in 2025 is the grunt work is 
# almost always done for you so you can just import packages that do tasks for you

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
import seaborn as sns
import util

In [2]:
# Integer
x = 10  
print(x)

10


In [3]:
# Float
y = 10.5  
print(y)  # <class 'float'>

10.5


In [4]:
# String
name = "Leponda"
print(name)  # <class 'str'>

Leponda


In [5]:
# Boolean
is_student = True
print(is_student)  # <class 'bool'>

True


In [6]:
pledges = ["gurnoor", "arjun", "sarah", "katie", "sadie", "aathma", "jay"]
pledge_points = [4, 4, -1, 8, 16, 4, 4] # as of 03/17 at 7:21 PM

print(pledges)  # <class 'list'>
print(pledge_points)  # <class 'list'>

['gurnoor', 'arjun', 'sarah', 'katie', 'sadie', 'aathma', 'jay']
[4, 4, -1, 8, 16, 4, 4]


There are a few manipulations you can do with lists that are pretty useful

Guess what this will do

In [7]:
[1, 2, 3] + [4, 5, 6]

[1, 2, 3, 4, 5, 6]

Now we're using numpy arrays, similar to lists but guess what this will return

In [8]:
np.array([1, 2, 3]) + np.array([4, 5, 6])

array([5, 7, 9])

In [9]:
a = np.array([[1, 2, 3],
             [4, 5, 6],
             [7, 8, 9]])

b = np.array([[2, 4, 6,],
             [8, 10, 12],
             [14, 16, 18]])
a + b

array([[ 3,  6,  9],
       [12, 15, 18],
       [21, 24, 27]])

Now try this out yourself! Create a 3 x 3 matrix like such:

$$ \begin{bmatrix}
1 & 2 & 3\\
4 & 5 & 6\\
7 & 6 & 9\\
\end{bmatrix} $$

Then, output a matrix where each column is subtracted by its (each column's) average.  
_Hint: use np.mean(axis=0)_

In [10]:
#TODO: define np.array with the above items
arr = np.array([[1,2, 3],
                [4,5, 6],
                [7, 8,9]
                ])

In [14]:
#TODO: get means of each column and subtract from each column in matrix
means = np.mean(arr, axis=0)
output = arr - means

In [15]:
#TODO: print the output
output

array([[-3., -3., -3.],
       [ 0.,  0.,  0.],
       [ 3.,  3.,  3.]])

In [None]:
# Dictionary (key-value pairs)
trash_pledge_leaderboard = {"top_pledge": "rahil", "bottom_pledge": "shivani"}
print(trash_pledge_leaderboard)  # <class 'dict'>

This doesn't look right, let's use the lists we created and update the dictionary with the correct pledge and pledge points.

In [None]:
pledge_to_points = zip(pledges, pledge_points)
pledge_leaderboard = dict(pledge_to_points)

print(pledge_leaderboard)

Let's use some python syntax to return the top and bottom pledge. We'll start with a brief overview of for loops and if statements in python.

In [None]:
for pledge in pledges:
    print(pledge)

In [None]:
for i in range(len(pledges)):
    print(pledges[i])

In [None]:
for pledge, points in pledge_leaderboard.items():
    if points > np.mean(pledge_points):
        print("The pledges doing above average are", pledge)
        

In [None]:
least_points = float('inf')
most_points = float('-inf')

bottom_pledge = ""
top_pledge = ""

for pledge, points in pledge_leaderboard.items():
    if points < least_points:
        least_points = points
        bottom_pledge = pledge
        
    if points > most_points:
        most_points = points
        top_pledge = pledge

In [None]:
print("top pledge is", top_pledge, "with", pledge_leaderboard[top_pledge], "points")
print("bottom pledge is", bottom_pledge, "with", pledge_leaderboard[bottom_pledge], "points")

The last piece of syntax we'll go over is indexing and slicing

In [None]:
print(pledges)

In [None]:
# direct indexing
pledges[2]

In [None]:
pledges[-1]

You can also use slicing:

`list[start:end:step]`

If you leave `start` blank it will default to 0
If you leave `end` blank it will default to the last item in the list
If you leave `step` blank it will default to 1

In [None]:
pledges[1:6:2]

#### 2.3: Using Pandas for Exploratory Data Analysis
We will use a data set from sklearn to practice about california housing, pandas enables us to read this information in as a 'dataframe'.

In [None]:
df = pd.read_csv("train.csv")

Use `.head()` to get the first 5 rows of your data frame

In [None]:
df.head(2)

A few operations on the dataframe you can use to extract information

In [None]:
# df.head()  # Show first 5 rows
# df.tail(3)  # Show last 3 rows
# df.shape  # Get number of rows and columns
# df.columns  # List column names
# df["MedInc"].value_counts() # summary of specific value occurences for a set
df.describe()  # Summary statistics for numerical columns


You can reference specific column names using brackets

In [None]:
df["MedInc"]  # Select a single column (returns a Series)
df[["MedInc", "HouseAge"]]  # Select multiple columns

There are two methods to access specific partitions of the dataframe in pandas including `.query()` and bracket notation

In [None]:
df.query("MedInc > 5.6431 and HouseAge > 20")

In [None]:
df[(df["MedInc"] > 5) & (df["HouseAge"] > 20)]

In [None]:
df[(df["MedInc"] > 5) | (df["HouseAge"] > 20)]

We can also get specfic rows and columns using `.iloc[]` and `.loc[]`

In [None]:
df.iloc[0]  # Select first row (by index)
df.iloc[:3]  # Select first three rows

df.loc[0, "MedInc"]  # Select a specific value (row 0, column "Name")
df.loc[:, "HouseAge"]  # Select all rows for "Age" column

The next pandas operations we will cover is sorting which you can do in ascending and descending order and also across multiple columns

args:

`by` decides which columns to sort by

`ascending` dictates the order of sorting (increasing, decreasing)

In [None]:
df.sort_values(by="HouseAge")  # Sort by HouseAge (ascending)
df.sort_values(by="HouseAge", ascending=False)  # Sort by HouseAge (descending)
df.sort_values(by=["HouseAge", "MedInc"], ascending=[True, False])  # Sort by multiple columns

Now that you have an understanding of how to view different parts of the dataframe, you can create your own columns using the below commands

In [None]:
df["MedIncNorm"] = (df["MedInc"] - min(df["MedInc"])) / (max(df["MedInc"]) - min(df["MedInc"]))  # Adding new column normalizing between 0-1

If you realize adding that column is a dumb ass idea, you can `.drop()` the column

`axis` indicates whether to do the operation across all column or all rows (axis = 1 performs operations across rows and axis = 0 performs operations across columns)

`inplace` dictates whether you mutate the original dataframe or create a copy dataframe with the operation updated

In [None]:
#TODO: Drop MedIncNorm column from dataframe

Finally, we'll explore aggregation & grouping across columns

In [None]:
df["MedInc"].mean()  # Average salary
df["MedInc"].min()  # Minimum age

We can also aggregate HouseAges together and take their mean. For example, for a HouseAge of 1.0, we get a 'Population' mean of 328.500000

In [None]:
df.groupby("HouseAge")["Population"].mean().head(5)  # Average salary per age group

Readability of your code and dataframe are very important, some of these column names are not super readable so let's `.rename()` them

`columns` key-value pairs that map old column name to new column name

`inplace` same as above

In [None]:
df.rename(columns={"MedInc": "MedianIncome"}, inplace=True)
df

We can also improve readability of code using comment lines to document commands. Highlight any block of code and hit `Cmd + /` to comment all of it. Without highlighting anything, `Cmd + /` will comment the current line you're on. Generally adding a `#` before any line will make it commented.

Use any of the above commenting methods to remove the error from below

In [None]:
# below I am printing voyager seniors

print("shivani")
print("christine")
print("eric")
print("nolan")
print("vaarun")

#### 2.4: Visualization
Presenting your findings and visualizing trends is an essential data science skill. We will use matplotlib.pyplot, which is industry standard for data visualization as it provides a lot of freedom in your visualization.

In [None]:
df

In [None]:
# Honestly i think i am better suited doing this for you guys so you can see how plotting works

### Section 3: Basics of Linear Modeling
#### 3.1: Building a Linear Model

Your objective is to be able to predict a value based on other features by constructing a linear relationship between the features and the predicted value. I've provided you guys with a training set and a testing set, we'll talk about how you can use this  

In [None]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [None]:
X_train = train_df.drop(columns=["PRICE", "Id"])  # Drop target column
y_train = train_df["PRICE"]  # Target variable

In [None]:
X_test = test_df

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)

#### 3.2 Model Evaluation
For our purposes, we'll be evaluating on r-squared, the formula of which is provided below. This metric is essentially explaining the proportion of variance the relationship you are modeling accounts for. In simple terms, it is a measure of correlation of our linear model that is normalized between 0 and 1. Closer to 1 means generally more accurate and closer to 0 is less accurate (although if it is 1.0 exactly your model is probably overfitting).

$$ 
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
$$

In [None]:
print(f"R-squared: {r2_score(y_train, y_train_pred):.4f}")

We can see that the linear regression model only provides an r-squared of 0.61 so ts is some mid. 

In terms of evaluating your model, you can use this helper method I have below

In [None]:
def generate_submission_file(model, test_features, name, submission_number):

    ids = test_features["Id"].astype(str) 
    test_features = test_features.drop(columns=["Id"])  # drop 'Id' for prediction

    X_test = test_features[X_train.columns]
    y_pred = model.predict(X_test)

    submission = pd.DataFrame({
        "Id": ids,  
        "Predicted": y_pred 
    })

    filename = f"{name}_{submission_number}.csv"
    submission.to_csv(filename, index=False)

    return filename


In [None]:
test_features = pd.read_csv("test.csv")
generate_submission_file(model=model, test_features=test_features, name="Rahil", submission_number=1) # update your submisssion number


### Submissions & Deliverables

Now that you have your submission csv which should populate in your file directory after you run the above cell, see how your results compare with the rest of the PDM kaggle competition:

1. Go to the below link:

https://www.kaggle.com/competitions/pdm-linear-modeling-comp/code

2. Click 'Submit Predictions'

3. Then click 'Upload Submission'

4. Click 'Submit'

5. Check the 'Leaderboard' and see how your results compare!

6. Submit as many submissions as you want, you should be able to achieve > 0.90 r2 on this dataset (try using random forest regression instead of linear regression, hyperparam tuning, feature engineering)

In [None]:
#TODO: Build a linear model and submit your csv on kaggle 