# Python Practice Lecture 5 MATH 342W Queens College 
## Author: Amir ElTabakh
## Date: February 10, 2022

### Agenda
- Loading datasets
- `.iloc()` and list comprehension
- The null model
- The Threshold model
- The Perceptron model

In this practice demo we will go over how to import datasets, both from Python libraries and locally. There are a number of standard datasets used to practice one's data science skills. Many of these datasets are accesible via Python libraries.

### Scikit learn (sklearn)
Scikit-learn is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for ML and statistical modeling including classification, regression, clustering and dimensionality reduction (which we won't get to) via Python. This library is largely written in Python and utilizes the NumPy, SciPy and Matplotlib packages. The library was initially developed by David Courpnapeau as a Google Summer of Code project in 2007. Later some scientists from the French Institute of Research in Computer Science and Automation took this project to another level and made the first public release in 2010.

There are so many libraries that provide different datasets to demo in Python. You can read up on some [here](https://towardsdatascience.com/datasets-in-python-425475a20eb1). 

- ️pydataset: Dataset package,
- ️seaborn: Data Visualisation package,
- ️sklearn: Machine Learning package,
- ️statsmodel: Statistical Model package and
- ️nltk: Natural Language Tool Kit package

In [1]:
# Install sklearn
!pip install -U scikit-learn





### Boston Housing Dataset sklearn
The sklearn Boston dataset is used wisely in regression and is a famous dataset that was published in 1978. There are 506 instances and 14 attributes, which we will explore later.

First we will load the Boston Housing dataset, lightly explore the documentation, and then load the data into a Pandas DataFrame.

In [2]:
# Import sklearn's datasets module
from sklearn import datasets

# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Load the Boston Housing dataset as bh
bh = datasets.load_boston()

# The Boston dataset is essentially a dictionary, let's check it's keys
bh.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename', 'data_module'])

The Boston Housing 'dictionary' contains data `X`, target feature `y`, feature names `p` and DESCR is the description of the data.

In [3]:
print(bh['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [4]:
# Import Pandas
import pandas as pd

# Create Boston Housing df
df = pd.DataFrame(data = bh.data, columns = bh.feature_names)

# Load the first 5 rows of df
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [5]:
# A quick view on some of the statistical information in our dataset
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


### `pd.DataFrame.iloc`

The `iloc` property gets, or sets, the value(s) of the specified indices in a DataFrame.

To get a single value specify both row and column with their corresponding indices.

`df.iloc[6, 8]`

To access more than one row, use double brackets and specify the indices, separated by commas:

`df.iloc[[0, 2]]` or  `df.iloc[[0, 2], :]`

Specify columns by including their indices in another list:

`df.iloc[[0, 2], [0, 1]]`

You can also specify a slice of the DataFrame with from and to indices, separated by a colon:

`df.iloc[[0:2]]` or  `df.iloc[[0:2], :]`

In [6]:
# Get the 4th row (index 3) of df
df.iloc[3,:]

CRIM         0.03237
ZN           0.00000
INDUS        2.18000
CHAS         0.00000
NOX          0.45800
RM           6.99800
AGE         45.80000
DIS          6.06220
RAD          3.00000
TAX        222.00000
PTRATIO     18.70000
B          394.63000
LSTAT        2.94000
Name: 3, dtype: float64

The above row displayed in this form is called a Pandas Series. A 'series' is what you call a row or column vector in Python.

In [7]:
type(df.iloc[3,:])

pandas.core.series.Series

In [8]:
# Get the element at row 4 column 5
df.iloc[3, 4]

0.458

In [9]:
# What is the type of the value above
type(df.iloc[3, 4])

numpy.float64

In [10]:
# Get rows 2 and 7
df.iloc[[1, 6], [4, 5, 6]]

Unnamed: 0,NOX,RM,AGE
1,0.469,6.421,78.9
6,0.524,6.012,66.6


The above row values displayed in this form is called a Pandas DataFrame.

In [11]:
type(df.iloc[[1, 6], [4, 5, 6]])

pandas.core.frame.DataFrame

In [12]:
# Get values at rows 2 and 7 and column 0
df.iloc[[2, 7], 0]

2    0.02729
7    0.14455
Name: CRIM, dtype: float64

In [13]:
# Get the last column
df.iloc[:, -1]

0      4.98
1      9.14
2      4.03
3      2.94
4      5.33
       ... 
501    9.67
502    9.08
503    5.64
504    6.48
505    7.88
Name: LSTAT, Length: 506, dtype: float64

The above column displayed in this form is called a Pandas Series. So a single row or a single column is a Series.

In [14]:
type(df.iloc[:, -1])

pandas.core.series.Series

In [15]:
# Get all rows 80 through 85, but only columns index 0 through 4
df.iloc[80:86, 0:5]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX
80,0.04113,25.0,4.86,0.0,0.426
81,0.04462,25.0,4.86,0.0,0.426
82,0.03659,25.0,4.86,0.0,0.426
83,0.03551,25.0,4.86,0.0,0.426
84,0.05059,0.0,4.49,0.0,0.449
85,0.05735,0.0,4.49,0.0,0.449


In [16]:
# Get all rows for the first 7 columns
df.iloc[:, :7]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2
...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3


Great, now you know how to crop your data. This could be useful for identifying elements and extracting/excluding data.

### List Comprehension

Let's add another skill to our growing arsenal. Python is known for easy-to-understand and efficient programming. One of the most distinctive aspects of the language is the Python list and the list comprehension feature, which we can use within a single line of code to construct powerful functionality. List comprehensions are used for creating new lists from other iterables like tuples, strings, arrays, lists, etc. A list comprehension consists of brackets containing the expression, which is executed for each element along with the for loop to iterate over each element.

List comprehension also runs faster than for loops.

`newList = [ expression(element) for element in oldList if condition ]`

Lets take a look at it in action.

In [17]:
# multiply each element in this list by 2 and cast as an int using list comprehension
some_elements = [2.33, 4.56, 88.9, 34, 103.23]

new_elems = [int(x * 2) for x in some_elements] # notice int() rounds to the floor
new_elems

[4, 9, 177, 68, 206]

In [18]:
# Import required module
import time


# define function to implement for loop
def for_loop(n):
    result = []             # empty list
    for i in range(n):      # iterate n times
        result.append(i**2) # append i^2 to list
    return result           # return the now populated list


# define function to implement list comprehension
def list_comprehension(n):
    return [i**2 for i in range(n)]
 

    
# Driver Code 
 
# Calculate time takens by for_loop()
begin = time.time()
for_loop(10**7)
end = time.time()
 
# Display time taken by for_loop()
print('Time taken for_loop:',round(end-begin,2))



# Calculate time takens by list_comprehension()
begin = time.time()
list_comprehension(10**7)
end = time.time()
 
# Display time taken by for_loop()
print('Time taken for list_comprehension:',round(end-begin,2))

Time taken for_loop: 2.48
Time taken for list_comprehension: 2.05


## Building Models

We'll recreate our DataFrame from the last demo.

In [19]:
# Generating list of n many salaries
from random import gauss
# We will use the numpy library
import numpy as np


list_of_names = ["Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", 
                      "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", 
                      "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", 
                      "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", 
                      "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", 
                      "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", 
                      "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", 
                      "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", 
                      "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", 
                      "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", 
                      "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", 
                      "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", 
                      "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", 
                      "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", 
                      "Landon", "David", "Christian", "Andrew", "Brayden", "John", 
                      "Lincoln"]

n = len(list_of_names)
mu = 50000 # mean
sigma = 20000 # variance
salaries = []

for i in range(n):
    salaries += [int(gauss(mu, sigma))]
    

# Generating list of n many past_crime_severity values
items = ["no crime", "infraction", "misdimeanor", "felony"]
probs = [.50, .40, .08, .02]
past_crime_severity = np.random.choice(items, n, p = probs) # run `help(choices)` to read documentation


# Generating list of n many has_past_unpaid_loan values
has_past_unpaid_loan = np.random.binomial(n = 1, size = n, p = 0.2)

# Generating response feature y
paid_back_loan = np.random.binomial(n = 1, size = n, p = 0.9)


# Initializing Pandas DataFrame
df = pd.DataFrame({'Salary' : pd.Series(salaries, index = list_of_names),
                  'past_crime_severity' : pd.Series(past_crime_severity, index = list_of_names),
                  'has_past_unpaid_loan' : pd.Series(has_past_unpaid_loan, index = list_of_names),
                  'paid_back_loan' : pd.Series(data = paid_back_loan, index = list_of_names)
                  })

# Cast the has_past_unpaid_loan feature as bool
df['has_past_unpaid_loan'] = df['has_past_unpaid_loan'].astype('bool')

df.head()

Unnamed: 0,Salary,past_crime_severity,has_past_unpaid_loan,paid_back_loan
Sophia,61399,infraction,False,1
Emma,56945,infraction,False,1
Olivia,49087,no crime,False,1
Ava,52389,no crime,False,1
Mia,59413,misdimeanor,False,1


Note that our DataFrame does not just comprise of `X`, it also has `y`. Professor Kapelner likes to create a new object called `Xy` to denote that the data contains both the descriptive features and the response variable. It is the Python standard to call the DataFrame just `df` if you do not have any other DataFrames in the current instance, so we'll stick with that.

### The Null Model

The `y` variable we wish to predict on is `paid_back_loan`. This variable is discrete, not continuous, so our null model $g_0$ will consider the mode of `y` rather than the mean.

In [20]:
# Defining the null model
g_0 = df['paid_back_loan'].mode()[0] 
g_0

1

### The Threshold Model

Let's compute the threshold model and see what happens. Here's an inefficent but quite pedagogical way to do this:

In [21]:
n = len(df)
y_logical = df['paid_back_loan']
threshold_param = []
num_errors = []

for i in range(0, n):
    threshold = df['Salary'][i]
    threshold_param += [threshold]
    num_errors += [sum((df['Salary'] > threshold) != y_logical)]
    
num_errors_by_parameter = {"threshold_param": threshold_param, "num_errors": num_errors}
num_errors_by_parameter = pd.DataFrame(num_errors_by_parameter)

num_errors_by_parameter

Unnamed: 0,threshold_param,num_errors
0,61399,76
1,56945,68
2,49087,47
3,52389,55
4,59413,72
...,...,...
95,69350,85
96,45395,41
97,44155,35
98,48404,45


In [22]:
# Look at all of the threshold_params in order
num_errors_by_parameter.sort_values(by = 'num_errors')

Unnamed: 0,threshold_param,num_errors
23,14573,6
6,16781,7
5,17527,8
62,17632,9
11,21119,10
...,...,...
28,86628,91
55,89923,92
79,92340,93
48,96780,94


In [23]:
# Sort the df by `num_errors`
ordered_num_errors_by_parameter = num_errors_by_parameter.sort_values(by = 'num_errors')

In [24]:
# Identify the threshold_param with the least errors
best_row = ordered_num_errors_by_parameter.iloc[0]
best_row

threshold_param    14573
num_errors             6
Name: 23, dtype: int64

In [25]:
# get best threshold_param
x_star = best_row[0]
x_star

14573

Lets program `g`, the model that is shipped as the prediction function for future $x_*$.

In [26]:
# define predictive function g
def g(x):
    if x > x_star:
        return 1
    else:
        return 0

In [27]:
# Test 1
print(g(15000))

# Test 2
print(g(5000))

# Is this expected?

1
0


Great! We just built a Threshold model, now lets move onto the last model of the demo.

## The Perceptron Model

Let's import the Breast Cancer dataset for this.

In [28]:
# Importing dataset
from sklearn.datasets import load_breast_cancer

# Initializing dataset
BC = load_breast_cancer()

# Check out the keys in the dataset
BC.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [29]:
# Good documentation is important
BC['DESCR']



In [30]:
# Initialize Breast Cancer df
df = pd.DataFrame(data = BC.data, columns = BC.feature_names)

In [31]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [32]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [33]:
# Define descriptive features
X = df.iloc[:, :9]
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809


In [34]:
# Define response feature
y = BC.target
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [35]:
import numpy as np

# Print num of 1's
np.unique(y, return_counts = True)

(array([0, 1]), array([212, 357], dtype=int64))

First question. Let $\mathcal{H}$ be the set $\{0, 1\}$ meaning $g = 0$ or $g = 1$. What are the error rates then on $\mathbb{D}$?

In [36]:
# If always 0, all the 1's are errors
212 / (212 + 357)

0.37258347978910367

In [37]:
# If always 1, all the 0's are errors
357 / (212 + 357)

0.6274165202108963

If your $g$ can't beat that, either your features $x_1, \ldots, x_p$ are terrible, and/or $\mathcal{H}$ was a terrible choice and/or $\mathcal{A}$ can't pull its weight.

Okay... back to the "perceptron learning algorithm".

Let's do so for one dimension - just "V1" in the breast cancer data. You will do an example with more features for the lab.

In [38]:
MAX_ITER = 100
w_vec = [0, 0]

# Creating 1 vector
n = len(X)
one_vec = []
for i in range(n):
    one_vec.append(1)

# "c-binding"
data = {"1": one_vec, "mean radius": X['mean radius']}
X1 = pd.DataFrame(data)

for iter in range(MAX_ITER):
    for i in range(len(X1)):
        x_i = X1.iloc[i]
        sum_x_and_w_vec = x_i[0] * w_vec[0] + x_i[1] * w_vec[1]
        yhat_i = 1 if sum_x_and_w_vec > 0 else 0
        y_i = y[i]
    
        w_vec[0] = w_vec[0] + (y_i - yhat_i) * x_i[0]
        w_vec[1] = w_vec[1] + (y_i - yhat_i) * x_i[1]
        
w_vec

[722.0, -53.990000000000094]

In [39]:
# What is our error rate

# The @ symbol is used for matrix multiplication (numpy package)
matrix_mult = X1 @ w_vec

yhat = [1 if x > 0 else 0 for x in matrix_mult] # boolean, need to cast

yhat = [int(x) for x in yhat]

y = [int(x) for x in y] # list comprehension

total = 0
for i in range(len(y)):
    if yhat[i] != y[i]:
        total += 1
        
error_rate = round(total/len(y), 4)

print(f"Error rate: {error_rate}")

Error rate: 0.1863
