# Welcome to Part 2 of the BIOL 1595 Python Primer!

Part 2 of this primer will cover an intro to NumPy and Pandas, working with files, and an intro to scikit-learn. The goal for this section is to show you how to use these common techniques that you will likely be using in future coding assignments for BIOL 1595.

## 5. Intro to NumPy

In this section, we will go over the basics of NumPy, one of the most commonly used Python libraries. This will make handling data much more efficient and allows us to use built-in functions so we don't have to write them ourselves! We won't cover everything NumPy has to offer, so be sure to check out the documentation for each of them if you ever get stuck.

One of the key aspects of NumPy is that it allows for element wise operations. This just means we can modify data in numpy arrays (numpy's version of lists) more efficiently.

In [None]:
import numpy as np

#Example 5.1 NumPy Arrays and Element Wise Operations

# Create a NumPy array from a Python list
nums = np.array([1, 2, 3, 4, 5])

print(nums)
print(type(nums))

nums = nums * 2 # Element wise operation: multiply each element by 2
print(nums)

#Without numpy, you would have to do this with a for loop:
new_nums = []
for num in [1, 2, 3, 4, 5]:
    new_nums.append(num * 2)
print(new_nums)


[1 2 3 4 5]
<class 'numpy.ndarray'>
[ 2  4  6  8 10]
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]


Similarly, we can do something called boolean masking with numpy arrays. This is a type of filtering where we can remove or manipulate data in an array based on if it satisfies some condition. Note that we do something similar in example 2.3, list comprehension, in part 1 of the primer. But the advantage of boolean masking, which only works with numpy arrays, is that it is faster and more efficient.

In [None]:
#Example 5.2 Boolean Masking

# Create a NumPy array
data = np.array([10, 15, 20, 25, 30, 35, 40])
# Create a boolean mask for values greater than 25
mask = data > 25
print(mask)
print(data[mask]) # Use the mask to filter the data

[False False False False  True  True  True]
[30 35 40]


In [12]:
#Example 5.3: Common numpy functions

data = np.array([72, 75, 78, 90, 85, 88, 110, 92])

#Summary statistics
mean = np.mean(data)
median = np.median(data)
min = np.min(data)
max = np.max(data)
std = np.std(data)

print("Mean:", mean)
print("Median:", median)
print("Min:", min)
print("Max:", max)
print("Std:", std)

sorted_data = np.sort(data)
unique_data = np.unique(data)
filtered_data = np.where(data > 80, data, 0) # Replace values greater than 80 with themselves, and others with 0
print("Sorted data:", sorted_data)
print("Unique data:", unique_data)
print("Filtered data:", filtered_data)

Mean: 86.25
Median: 86.5
Min: 72
Max: 110
Std: 11.255554184490428
Sorted data: [ 72  75  78  85  88  90  92 110]
Unique data: [ 72  75  78  85  88  90  92 110]
Filtered data: [  0   0   0  90  85  88 110  92]


## 6. Working with Files and Pandas

Data is almost always stored in some type of files, so extracting the necessary information and storing it in data structures so we can work with it is important. While we can read and write files with normal python, Pandas is a library that makes this simpler by introducing DataFrames, a new type of data structure.

In [19]:
#Example 6.1: Reading CSV files with regular Python

#Open the file for reading
with open('data/hd_data.csv', 'r') as file:
    #the readline() function only reads one line at a time and returns strings
    print(file.readline())
    print(file.readline())

with open('data/hd_data.csv', 'r') as file:
    #the readlines() function reads all lines and returns a list of strings
    lines = file.readlines()
    print(lines[0]) # Print the header
    print(lines[1]) # Print the first data row


Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease

70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence

Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease

70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence



This way of reading files isn't very helpful, since we have a lot of uneccessary characters in the returned strings, like the commas, and have no way of selecting values in a specific column. We will instead use the split and strip functions covered in part 1.

In [None]:
ages = []
cholesterol = []

with open('data/hd_data.csv', 'r') as file:
    #recall that strip() removes the whitespace characters,
    #and split() splits the string into a list based on the specified delimiter, in this case ","
    header = file.readline().strip().split(",")
    ages_index = header.index("Age")
    cholesterol_index = header.index("Cholesterol")

    for line in file: #another way to read all lines in a file
        values = line.strip().split(",")

        ages.append(values[ages_index])
        cholesterol.append(values[cholesterol_index])

#Print first 5 elements in each list
print("Ages:", ages[:5])  
print("Cholesterol:", cholesterol[:5])

Ages: ['70', '67', '57', '64', '74']
Cholesterol: ['322', '564', '261', '263', '269']


We will now use Panda's DataFrame object. Using the read_csv() function, pandas can automatically read the csv file and assign column names

In [28]:
#Example 6.2: Reading files with Pandas
import pandas as pd
df = pd.read_csv('data/hd_data.csv')
print(df.head()) # Print the first 5 rows of the DataFrame
print("Columns: ", df.columns) # Print the column names
print("Ages:", df["Age"].head()) # Print the first 5 values in the "Age" column
print("Cholesterol:", df["Cholesterol"].head()) # Print the first 5 values in the "Cholesterol" column


   Age  Sex  Chest pain type   BP  Cholesterol  FBS over 120  EKG results  \
0   70    1                4  130          322             0            2   
1   67    0                3  115          564             0            2   
2   57    1                2  124          261             0            0   
3   64    1                4  128          263             0            0   
4   74    0                2  120          269             0            2   

   Max HR  Exercise angina  ST depression  Slope of ST  \
0     109                0            2.4            2   
1     160                0            1.6            2   
2     141                0            0.3            1   
3     105                1            0.2            2   
4     121                1            0.2            1   

   Number of vessels fluro  Thallium Heart Disease  
0                        3         3      Presence  
1                        0         7       Absence  
2                        0   

## 7. Intro to Scikit-Learn

Scikit-learn is a popular python library for building and evaluating machine learning models. In the examples below, we make small models for classification and regression, then calculate metrics for the models to evaluate performance 

In [30]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


df = pd.read_csv('data/hd_data.csv')
#Features: pick a few relevant ones
X = df[['Age', 'BP', 'Cholesterol', 'Max HR']]
y = df['Heart Disease']  # target variable, this is what we will predict

#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Create and train model
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

#Predict on test set
predictions = clf.predict(X_test)
print("Predicted heart disease for test set:", predictions[:5])



Predicted heart disease for test set: ['Absence' 'Absence' 'Absence' 'Absence' 'Presence']


In [31]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


df = pd.read_csv('data/hd_data.csv')
#Features: Age, BP, Cholesterol
X = df[['Age', 'BP', 'Cholesterol']]
y = df['Max HR']  # target variable

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train model
reg = LinearRegression()
reg.fit(X_train, y_train)

# Predict on test set
predicted = reg.predict(X_test)
print("Predicted Max HR for test set:", predicted[:5])


Predicted Max HR for test set: [146.02857731 158.45157662 147.13197406 150.61168118 144.80808981]


In [34]:
from sklearn.metrics import accuracy_score, r2_score, recall_score, precision_score
import pandas as pd
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split


df = pd.read_csv('data/hd_data.csv')
#Converting string labels into binary labels for classification
df['Heart Disease Numeric'] = df['Heart Disease'].replace({'Absence': 0, 'Presence': 1})

X_class = df[['Age', 'BP', 'Cholesterol', 'Max HR']]
y_class = df['Heart Disease Numeric']
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_class, y_class, test_size=0.3, random_state=42)

#Classification metrics: accuracy, recall, and precision
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_c, y_train_c)
y_pred_class = clf.predict(X_test_c)
accuracy = accuracy_score(y_test_c, y_pred_class)
recall = recall_score(y_test_c, y_pred_class)
precision = precision_score(y_test_c, y_pred_class)
print("Classification accuracy:", accuracy)
print("Classification recall:", recall)
print("Classification precision:", precision)


X_reg = df[['Age', 'BP', 'Cholesterol']]
y_reg = df['Max HR']
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

#Regression metric: R² score
reg = LinearRegression()
reg.fit(X_train_r, y_train_r)
y_pred_reg = reg.predict(X_test_r)
r2 = r2_score(y_test_r, y_pred_reg)
print("Regression R² score:", r2)


Classification accuracy: 0.6419753086419753
Classification recall: 0.53125
Classification precision: 0.5483870967741935
Regression R² score: 0.15768669328656026


  df['Heart Disease Numeric'] = df['Heart Disease'].replace({'Absence': 0, 'Presence': 1})
