# **Introduction to Pandas**

Pandas is an open-source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is based on the dataframe concept found in the R programming language. For this class, Pandas will be the primary means by which we manipulate data to be processed by neural networks.

In [171]:
# The following code ensures that Google CoLab is running the correct version of TensorFlow.

try:
    from google.colab import drive
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
Note: using Google CoLab


# Connecting to Google Drive
Sign-in using your Gmail account

In [172]:
import csv
import sys
import os
import glob
import itertools
import numpy as np
import math
import pandas as pd


from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).



**Note:**



*   Make sure to download the dataset (auto-mpg.csv) from the google drive location below.
*   Create a folder called Datasets in your Google drive account and put the file there
*   Run the cell below

**Link:** https://drive.google.com/drive/folders/1P5ADx0gHiJ4aoqy3BnC3quEDaTroHlUZ?usp=sharing



**Loading Data from Google Drive**

In [173]:
#file_path = 'gdrive/My Drive/Datasets/auto-mpg.csv'
file_path = 'https://data.heatonresearch.com/data/t81-558/auto-mpg.csv'

column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight','Acceleration', 'Model Year', 'Origin', 'Car Name']

data_frame = pd.read_csv(file_path,engine='python',skiprows =1,header=None,na_values=['-1'], index_col=False)
data_frame.columns = column_names
data_frame.head(5)


Unnamed: 0,MPG,Cylinders,Displacement,...,Model Year,Origin,Car Name
0,18.0,8,307.0,...,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,...,70,1,buick skylark 320
2,18.0,8,318.0,...,70,1,plymouth satellite
3,16.0,8,304.0,...,70,1,amc rebel sst
4,17.0,8,302.0,...,70,1,ford torino


Custom dataframe slice

In [174]:
pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 5)
display(data_frame)

Unnamed: 0,MPG,Cylinders,Displacement,...,Model Year,Origin,Car Name
0,18.0,8,307.0,...,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,...,70,1,buick skylark 320
...,...,...,...,...,...,...,...
396,28.0,4,120.0,...,82,1,ford ranger
397,31.0,4,119.0,...,82,1,chevy s-10


In [175]:
pd.reset_option('display')
data_frame.dtypes

Unnamed: 0,0
MPG,float64
Cylinders,int64
Displacement,float64
Horsepower,object
Weight,int64
Acceleration,float64
Model Year,int64
Origin,int64
Car Name,object


Getting Aggregate Information about our Dataset

In [176]:
# Strip non-numerics
data_frame = data_frame.select_dtypes(include=['int', 'float'])

headers = list(data_frame.columns.values)
fields = []

for field in headers:
    fields.append({
        'name' : field,
        'mean': data_frame[field].mean(),
        'var': data_frame[field].var(),
        'sdev': data_frame[field].std()
    })

for field in fields:
    print(field)

{'name': 'MPG', 'mean': np.float64(23.514572864321607), 'var': 61.089610774274405, 'sdev': 7.815984312565782}
{'name': 'Cylinders', 'mean': np.float64(5.454773869346734), 'var': 2.8934154399199943, 'sdev': 1.7010042445332094}
{'name': 'Displacement', 'mean': np.float64(193.42587939698493), 'var': 10872.199152247364, 'sdev': 104.26983817119581}
{'name': 'Weight', 'mean': np.float64(2970.424623115578), 'var': 717140.9905256768, 'sdev': 846.8417741973271}
{'name': 'Acceleration', 'mean': np.float64(15.568090452261307), 'var': 7.604848233611381, 'sdev': 2.7576889298126757}
{'name': 'Model Year', 'mean': np.float64(76.01005025125629), 'var': 13.672442818627143, 'sdev': 3.697626646732623}
{'name': 'Origin', 'mean': np.float64(1.5728643216080402), 'var': 0.6432920268850575, 'sdev': 0.8020548777266163}


**Converting to a data frame for better display**

In [177]:
pd.set_option('display.max_columns', 0)
pd.set_option('display.max_rows', 0)
df2 = pd.DataFrame(fields)
display(df2)

Unnamed: 0,name,mean,var,sdev
0,MPG,23.514573,61.089611,7.815984
1,Cylinders,5.454774,2.893415,1.701004
2,Displacement,193.425879,10872.199152,104.269838
3,Weight,2970.424623,717140.990526,846.841774
4,Acceleration,15.56809,7.604848,2.757689
5,Model Year,76.01005,13.672443,3.697627
6,Origin,1.572864,0.643292,0.802055


# **Missing Values**

In [178]:
import os
import pandas as pd

data_frame = pd.read_csv(file_path,engine='python',skiprows =1,header=None, index_col=False,na_values=['NA', '?'])
data_frame.columns = column_names
df = data_frame

In [179]:
import warnings
warnings.filterwarnings('ignore')

df = df.select_dtypes(include=['int', 'float'])
print(f"horsepower has na? {pd.isnull(df['Horsepower']).values.any()}")
print("Filling missing values...")
med = df['Horsepower'].median()
df['Horsepower'] = df['Horsepower'].fillna(med)
# df = df.dropna() # you can also simply drop NA values

print(f"horsepower has na? {pd.isnull(df['Horsepower']).values.any()}")

horsepower has na? True
Filling missing values...
horsepower has na? False


# **Dropping Fields**

In [180]:
df = data_frame
print(df.columns)

# Drop the name column
df.drop('Car Name', axis=1, inplace=True)

rows = ['Acceleration','Displacement']

# Drop multiple columns column
df.drop(rows, axis=1, inplace=True)
print(df.columns)


Index(['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
       'Acceleration', 'Model Year', 'Origin', 'Car Name'],
      dtype='object')
Index(['MPG', 'Cylinders', 'Horsepower', 'Weight', 'Model Year', 'Origin'], dtype='object')


# **Dealing with Outliers**

In [181]:
# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean())
                          >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)

In [182]:
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore

data_frame = pd.read_csv(file_path,engine='python',skiprows =1,
                         header=None, index_col=False,na_values=['NA', '?'])
data_frame.columns = column_names
df = data_frame

# Drop the name column
df.drop('Car Name', axis=1, inplace=True)

# create feature vector
med = df['Horsepower'].median()
df['Horsepower'] = df['Horsepower'].fillna(med)
df = df.select_dtypes(include=['int', 'float'])
# Drop outliers in horsepower
print("Length before MPG outliers dropped: {}".format(len(df)))
remove_outliers(df,'MPG',2)
print("Length after MPG outliers dropped: {}".format(len(df)))

pd.set_option('display.max_columns', 0)
pd.set_option('display.max_rows', 5)
display(df)

Length before MPG outliers dropped: 398
Length after MPG outliers dropped: 388


Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504,12.0,70,1
1,15.0,8,350.0,165.0,3693,11.5,70,1
...,...,...,...,...,...,...,...,...
396,28.0,4,120.0,79.0,2625,18.6,82,1
397,31.0,4,119.0,82.0,2720,19.4,82,1


In [183]:
import os
import pandas as pd

data_frame = pd.read_csv(file_path,engine='python',
                         skiprows =1,header=None,na_values=['-1'], index_col=False)
data_frame.columns = column_names
df = data_frame

print(f"Before drop: {list(df.columns)}")
df.drop('Origin', axis = 1, inplace=True)
print(f"After drop: {list(df.columns)}")

Before drop: ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Origin', 'Car Name']
After drop: ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year', 'Car Name']


# **Concatenating Rows and Columns**

In [184]:
# Create a new dataframe from name and horsepower

import os
import pandas as pd

data_frame = pd.read_csv(file_path,engine='python',skiprows =1,header=None,na_values=['-1'], index_col=False)
data_frame.columns = column_names
df = data_frame


col_horsepower = df['Horsepower']
col_name = df['Car Name']
result = pd.concat([col_name, col_horsepower], axis=1)

pd.set_option('display.max_columns', 0)
pd.set_option('display.max_rows', 5)
display(result)

Unnamed: 0,Car Name,Horsepower
0,chevrolet chevelle malibu,130
1,buick skylark 320,165
...,...,...
396,ford ranger,79
397,chevy s-10,82


In [185]:
# Create a new dataframe from first 2 rows and last 2 rows

import os
import pandas as pd

data_frame = pd.read_csv(file_path,engine='python',
                         skiprows =1,header=None,na_values=['-1'], index_col=False)
data_frame.columns = column_names
df = data_frame

result = pd.concat([df[0:2],df[-2:]], axis=0)

pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 0)
display(result)

Unnamed: 0,MPG,Cylinders,Displacement,...,Model Year,Origin,Car Name
0,18.0,8,307.0,...,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,...,70,1,buick skylark 320
396,28.0,4,120.0,...,82,1,ford ranger
397,31.0,4,119.0,...,82,1,chevy s-10


# **Training and Validation**

In [186]:
import os
import pandas as pd
import numpy as np

data_frame = pd.read_csv(file_path,engine='python',
                         skiprows =1,header=None,na_values=['-1'], index_col=False)
data_frame.columns = column_names
df = data_frame

# Usually a good idea to shuffle
df = df.reindex(np.random.permutation(df.index))

mask = np.random.rand(len(df)) < 0.8
trainDF = pd.DataFrame(df[mask])
validationDF = pd.DataFrame(df[~mask])

print(f"Training DF: {len(trainDF)}")
print(f"Validation DF: {len(validationDF)}")

Training DF: 309
Validation DF: 89


# **Converting a Dataframe to a Matrix**

---



In [187]:
df.values

array([[23.0, 4, 151.0, ..., 82, 1, 'amc concord dl'],
       [19.2, 8, 267.0, ..., 79, 1, 'chevrolet malibu classic (sw)'],
       [27.0, 4, 151.0, ..., 82, 1, 'pontiac phoenix'],
       ...,
       [39.0, 4, 86.0, ..., 81, 1, 'plymouth champ'],
       [16.0, 8, 400.0, ..., 73, 1, 'pontiac grand prix'],
       [31.8, 4, 85.0, ..., 79, 3, 'datsun 210']], dtype=object)

In [188]:
df[['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin', 'Car Name']].values

array([[23.0, 4, 151.0, ..., 82, 1, 'amc concord dl'],
       [19.2, 8, 267.0, ..., 79, 1, 'chevrolet malibu classic (sw)'],
       [27.0, 4, 151.0, ..., 82, 1, 'pontiac phoenix'],
       ...,
       [39.0, 4, 86.0, ..., 81, 1, 'plymouth champ'],
       [16.0, 8, 400.0, ..., 73, 1, 'pontiac grand prix'],
       [31.8, 4, 85.0, ..., 79, 3, 'datsun 210']], dtype=object)

# **Saving a Dataframe to CSV**

In [202]:
import os
import pandas as pd
import numpy as np

data_frame = pd.read_csv(file_path,engine='python',skiprows =1,
                         header=None,na_values=['-1'], index_col=False)
data_frame.columns = column_names
df = data_frame
path = 'gdrive/My Drive/Spring 2026/CPSC 360/SeanDatasets' # The folder must exist in your Google Drive

filename_write = os.path.join(path, "auto-mpg-shuffle.csv")
df = df.reindex(np.random.permutation(df.index))
# Specify index = false to not write row numbers
df.to_csv(filename_write, index=False)
print('Filename and Location: ',filename_write)
print("Done")


Filename and Location:  gdrive/My Drive/Spring 2026/CPSC 360/SeanDatasets/auto-mpg-shuffle.csv
Done


# **Saving a Dataframe to Pickle**

In [190]:
import os
import pandas as pd
import numpy as np
import pickle

path = "."

data_frame = pd.read_csv(file_path,engine='python',skiprows =1,
                         header=None,na_values=['-1'], index_col=False)
data_frame.columns = column_names
df = data_frame

filename_write = os.path.join(path, "auto-mpg-shuffle.pkl")
df = df.reindex(np.random.permutation(df.index))

with open(filename_write,"wb") as fp:
    pickle.dump(df, fp)

filename_read = os.path.join(path, "auto-mpg-shuffle.pkl")

with open(filename_write,"rb") as fp:
    df = pickle.load(fp)

pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 5)
display(df)

Unnamed: 0,MPG,Cylinders,Displacement,...,Model Year,Origin,Car Name
170,23.0,4,140.0,...,75,1,pontiac astro
37,18.0,6,232.0,...,71,1,amc matador
...,...,...,...,...,...,...,...
368,27.0,4,112.0,...,82,1,chevrolet cavalier wagon
173,24.0,4,119.0,...,75,3,datsun 710


Checking Data Types

In [191]:
data_frame.dtypes

Unnamed: 0,0
MPG,float64
Cylinders,int64
...,...
Origin,int64
Car Name,object


In [192]:
feature_list = df.columns.tolist()
for feature in feature_list:
  print("Column Name :",feature)
  print("Data Type :", df[feature].dtype)

Column Name : MPG
Data Type : float64
Column Name : Cylinders
Data Type : int64
Column Name : Displacement
Data Type : float64
Column Name : Horsepower
Data Type : object
Column Name : Weight
Data Type : int64
Column Name : Acceleration
Data Type : float64
Column Name : Model Year
Data Type : int64
Column Name : Origin
Data Type : int64
Column Name : Car Name
Data Type : object


# **Analyzing and Visualizing data**

**Exploratory Data Analysis**

In [193]:
data_frame = pd.read_csv(file_path,engine='python',skiprows =1,
                         header=None,na_values=['-1'], index_col=False)
data_frame.columns = column_names
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MPG           398 non-null    float64
 1   Cylinders     398 non-null    int64  
 2   Displacement  398 non-null    float64
 3   Horsepower    398 non-null    object 
 4   Weight        398 non-null    int64  
 5   Acceleration  398 non-null    float64
 6   Model Year    398 non-null    int64  
 7   Origin        398 non-null    int64  
 8   Car Name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


Finding the dimensions of the data frame

In [194]:
# Getting the number of instances and features
print(df.shape)

# Getting the dimensions of the data frame
print(df.ndim)

(398, 9)
2


In [195]:
df.head(10)

Unnamed: 0,MPG,Cylinders,Displacement,...,Model Year,Origin,Car Name
170,23.0,4,140.0,...,75,1,pontiac astro
37,18.0,6,232.0,...,71,1,amc matador
...,...,...,...,...,...,...,...
211,16.5,6,168.0,...,76,2,mercedes-benz 280s
289,16.9,8,350.0,...,79,1,buick estate wagon (sw)


In [196]:
print('All done')

All done


---


# **Practice Exercises (No Solutions)**

These are short practice tasks to help you get comfortable with **pandas DataFrames**. Try to complete each code cell.

## Practice 1 — Load & inspect

Load the dataset into a DataFrame named `df_practice`, then:
- print the shape
- display the first 5 rows
- show summary statistics for numeric columns

In [197]:
# TODO: Load the CSV into df_practice
# - Use the same URL used earlier in the notebook (auto-mpg.csv)
# - Tip: pd.read_csv(...)

import pandas as pd

file_path = "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv"

# YOUR CODE HERE
# df_practice = ...

# TODO: Inspect
# print(df_practice.shape)
# display(df_practice.head())
# display(df_practice.describe())


## Practice 2 — Missing values

Using `df_practice`:
1. Show the number of missing values per column.
2. Drop rows where `horsepower` is missing.
3. Fill remaining numeric missing values with the **median** of each column.

Store the cleaned result in `df_clean`.

In [198]:
# YOUR CODE HERE
# 1) missing per column
# 2) drop missing horsepower
# 3) fill remaining numeric NaNs with median

# df_clean = ...


## Practice 3 — Feature engineering & sorting

Using `df_clean`:
- Create a new column `power_to_weight = horsepower / weight`
- Sort descending by `power_to_weight`
- Show the top 10 rows with columns: `['name','horsepower','weight','power_to_weight','mpg']`

In [199]:
# YOUR CODE HERE
# df_feat = ...
