# **13. Preprocessing for Machine Learning in Python**
[Link](https://github.com/goodboychan/goodboychan.github.io/blob/main/_notebooks/2020-07-09-02-Introduction-to-Data-Preprocessing.ipynb)

## Chapter 1 - Introduction to Data Preprocessing
> In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data. This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Datacamp, Machine_Learning]
- image: 

In [98]:
import pandas as pd

### What is data preprocessing?
- Data Preprocessing
    - Beyond cleaning and exploratory data analysis
    - Prepping data for modeling
    - Modeling in python requires numerical input

### Missing data - columns
We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.

In [99]:
volunteer = pd.read_csv('volunteer_opportunities.csv')
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


In [100]:
volunteer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   opportunity_id      665 non-null    int64  
 1   content_id          665 non-null    int64  
 2   vol_requests        665 non-null    int64  
 3   event_time          665 non-null    int64  
 4   title               665 non-null    object 
 5   hits                665 non-null    int64  
 6   summary             665 non-null    object 
 7   is_priority         62 non-null     object 
 8   category_id         617 non-null    float64
 9   category_desc       617 non-null    object 
 10  amsl                0 non-null      float64
 11  amsl_unit           0 non-null      float64
 12  org_title           665 non-null    object 
 13  org_content_id      665 non-null    int64  
 14  addresses_count     665 non-null    int64  
 15  locality            595 non-null    object 
 16  region  

In [101]:
volunteer.dropna(axis=1, thresh=3).shape

(665, 24)

In [102]:
volunteer.shape

(665, 35)

### Missing data - rows
Taking a look at the `volunteer` dataset again, we want to drop rows where the `category_desc` column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

In [103]:
# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

48
(617, 35)


### Working with data types
- dtypes in pandas
    - object: string/mixed types
    - int64: integer
    - float64: float
    - datetime64 (or timedelta): datetime

### Exploring data types
Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.

In [104]:
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

### Converting a column type
If you take a look at the `volunteer` dataset types, you'll see that the column `hits` is type `object`. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type `int`.

In [105]:
# Print the head of the hits column
print(volunteer['hits'].head())

# Convert the hits column to type int
volunteer['hits'] = volunteer['hits'].astype(int)

# Look at the dtypes of the dataset
print(volunteer.dtypes)

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64


### Class distribution
- Stratified sampling
    - A way of sampling that takes into account the distribution of classes or features in your dataset

### Class imbalance
In the `volunteer` dataset, we're thinking about trying to predict the `category_desc` variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.

In [106]:
volunteer['category_desc'].value_counts()

category_desc
Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: count, dtype: int64

### Stratified sampling
We know that the distribution of variables in the `category_desc` column in the volunteer dataset is uneven. If we wanted to train a model to try to predict `category_desc`, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [107]:
from sklearn.model_selection import train_test_split

# Create a data with all columns except category_desc
volunteer_X = volunteer.dropna(subset=['category_desc'], axis=0)

# Create a category_desc labels dataset
volunteer_y = volunteer.dropna(subset=['category_desc'], axis=0)[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

category_desc
Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: count, dtype: int64


> Warning: stratify sampling on `train_test_split` cannot handle the `NaN` data, so you need to drop NaN values before sampling

## Chapter 2 - Standardizing Data
> This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance. This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Datacamp, Machine_Learning]
- image: 

In [108]:
import pandas as pd
import numpy as np

### Standardizing Data
- Standardization
    - Preprocessing method used to transform continuous data to make it look normally distributed
    - Scikit-learn models assume normally distributed data
        - Log normalization
        - feature Scaling
- When to standardize: models
    - Model in linear space
    - Dataset features have high variance
    - Dataset features are continuous and on different scales
    - Linearity assumptions

### Modeling without normalizing
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine dataset. One of the columns, `Proline`, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (`knn`) as well as the `X` and `y` sets you need to fit and score on.

In [109]:
wine = pd.read_csv('wine_types.csv')
wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [110]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X = wine[['Proline', 'Total phenols', 'Hue', 'Nonflavanoid phenols']]
y = wine['Type'] 

In [111]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state=42)

knn = KNeighborsClassifier()

# Fit the knn model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.6888888888888889


### Log normalization method
- Applies log transformation
- Natural log using the constant $e$ (2.718)
- Captures relative changes, the magnitude of change, and keeps everything in the positive space

### Checking the variance
Check the variance of the columns in the `wine` dataset.

In [112]:
wine.describe()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


### Log normalization in Python
Now that we know that the `Proline` column in our wine dataset has a large amount of variance, let's log normalize it.

In [113]:
# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542436
0.17231366191842012


### Scaling data for feature comparison
- Features on different scales
- Model with linear characteristics
- Center features around 0 and transform to unit variance(1)
- Transforms to approximately normal distribution

### Scaling data - investigating columns
We want to use the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the `wine` dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model. Using `describe()` to return descriptive statistics about this dataset, which of the following statements are true about the scale of data in these columns?



In [114]:
wine[['Ash', 'Alcalinity of ash', 'Magnesium']].describe()

Unnamed: 0,Ash,Alcalinity of ash,Magnesium
count,178.0,178.0,178.0
mean,2.366517,19.494944,99.741573
std,0.274344,3.339564,14.282484
min,1.36,10.6,70.0
25%,2.21,17.2,88.0
50%,2.36,19.5,98.0
75%,2.5575,21.5,107.0
max,3.23,30.0,162.0


### Scaling data - standardizing columns
Since we know that the `Ash`, `Alcalinity of ash`, and `Magnesium` columns in the `wine` dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.



In [115]:
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

print(wine_subset.iloc[:3])

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

print(wine_subset_scaled[:3])

    Ash  Alcalinity of ash  Magnesium
0  2.43               15.6        127
1  2.14               11.2        100
2  2.67               18.6        101
[[ 0.23205254 -1.16959318  1.91390522]
 [-0.82799632 -2.49084714  0.01814502]
 [ 1.10933436 -0.2687382   0.08835836]]


### Standardized data and modeling

### KNN on non-scaled data
Let's first take a look at the accuracy of a K-nearest neighbors model on the `wine` dataset without standardizing the data. The `knn` model as well as the `X` and `y` data and labels sets have been created already. Most of this process of creating models in scikit-learn should look familiar to you.



In [116]:
wine = pd.read_csv('wine_types.csv')

X = wine.drop('Type', axis=1)
y = wine['Type'] 

knn = KNeighborsClassifier()

In [117]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.7555555555555555


### KNN on scaled data
The accuracy score on the unscaled wine dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data. 



In [118]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Instantiate a StandardScaler
scaler = StandardScaler()

# Scale the training and test features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create a K-nearest neighbors model instance
knn = KNeighborsClassifier()

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train_scaled, y_train)

# Score the model on the test data
print(knn.score(X_test_scaled, y_test))


0.9333333333333333


## **Chapter 3 - Feature Engineering**

[Link ref:](https://gist.github.com/vidit0210/c2f74323c8d2096729000f98ffbee4ac)

Encoding categorical variables - binary


In [119]:
from sklearn.preprocessing import LabelEncoder

hiking = pd.read_json('hiking.json')

# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
print(hiking[['Accessible', 'Accessible_enc']].head())


  Accessible  Accessible_enc
0          Y               1
1          N               0
2          N               0
3          N               0
4          N               0


Encoding categorical variables - one-hot


In [120]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer["category_desc"])

# Take a look at the encoded columns
print(category_enc.head())


   Education  Emergency Preparedness  Environment  Health  \
0      False                   False        False   False   
1      False                   False        False   False   
2      False                   False        False   False   
3      False                   False        False   False   
4      False                   False         True   False   

   Helping Neighbors in Need  Strengthening Communities  
0                      False                      False  
1                      False                       True  
2                      False                       True  
3                      False                       True  
4                      False                      False  


Aggregating numerical features


In [121]:
import pandas as pd  

running_times_5k = pd.read_csv('running_times_5k.csv')
print(running_times_5k.head(2))

   name  run1  run2  run3  run4  run5
0   Sue  20.1  18.5  19.6  20.3  18.3
1  Mark  16.5  17.1  16.9  17.6  17.3


In [122]:
# Use .loc to create a mean column
running_times_5k["mean"] = running_times_5k[['run1', 'run2', 'run3', 'run4', 'run5']].mean(axis=1)

# Take a look at the results
print(running_times_5k.head(2))

   name  run1  run2  run3  run4  run5   mean
0   Sue  20.1  18.5  19.6  20.3  18.3  19.36
1  Mark  16.5  17.1  16.9  17.6  17.3  17.08


Extracting string patterns


In [123]:
import re
import pandas as pd

# Assuming "hiking.json" is a valid JSON file
hiking = pd.read_json("hiking.json")

# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.search(pattern, str(length))  # Ensure 'length' is converted to a string
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))

# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())


       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


Vectorizing text


In [124]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Read the CSV file into a DataFrame
volunteer = pd.read_csv('volunteer_opportunities.csv')

# Extract the 'title' column, which presumably contains textual information
title_text = volunteer["title"]

# Create an instance of the TfidfVectorizer
tfidf_vec = TfidfVectorizer()

# Transform the textual data into TF-IDF vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

# Print the resulting TF-IDF matrix
print(text_tfidf)


  (0, 375)	0.3163915503784279
  (0, 855)	0.38192461589865456
  (0, 493)	0.3405778550191958
  (0, 822)	0.38192461589865456
  (0, 959)	0.38192461589865456
  (0, 1061)	0.25544926998167106
  (0, 869)	0.38192461589865456
  (0, 404)	0.15529778130809513
  (0, 690)	0.24072387702158726
  (0, 1086)	0.2304728774077965
  (1, 297)	0.6824588570832413
  (1, 1095)	0.7309240099960025
  (2, 868)	0.3903159429478348
  (2, 587)	0.4150336487706769
  (2, 98)	0.2225785986760572
  (2, 930)	0.35917531643639333
  (2, 515)	0.3903159429478348
  (2, 43)	0.4150336487706769
  (2, 1063)	0.4150336487706769
  (3, 710)	0.2056446046212042
  (3, 523)	0.18207403700074992
  (3, 255)	0.295432987618702
  (3, 31)	0.31414199966461376
  (3, 739)	0.31414199966461376
  (3, 1012)	0.16847136678616276
  :	:
  (660, 711)	0.5093842109879203
  (660, 402)	0.45423880718111764
  (660, 808)	0.39909340337431487
  (660, 1084)	0.19657602494096119
  (660, 404)	0.20712526636615075
  (661, 979)	0.4701913038392676
  (661, 548)	0.4701913038392676
  

Text classification using tf/idf vectors


In [125]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Read the CSV file into a DataFrame
volunteer = pd.read_csv('volunteer_opportunities.csv')

# Drop rows with missing values in the 'category_desc' column
volunteer = volunteer.dropna(subset=['category_desc'])

# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]

# Create the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Transform the text data into TF-IDF vectors
text_tfidf = tfidf_vectorizer.fit_transform(volunteer["title"])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=42)

# Create a Multinomial Naive Bayes model
nb = MultinomialNB()

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))


0.5548387096774193


## **Chapter 4 - Selecting Features for Modeling**

Selecting relevant features


In [126]:
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of volunteer_subset
print(volunteer_subset.head())

   opportunity_id  content_id  event_time  \
1            5008       37036           0   
2            5016       37143           0   
3            5022       37237           0   
4            5055       37425           0   
5            5056       37426           0   

                                               title  hits  \
1                                       Web designer    22   
2      Urban Adventures - Ice Skating at Lasker Rink    62   
3  Fight global hunger and support women farmers ...    14   
4                                      Stop 'N' Swap    31   
5                               Queens Stop 'N' Swap   135   

                                             summary is_priority  category_id  \
1             Build a website for an Afghan business         NaN          1.0   
2  Please join us and the students from Mott Hall...         NaN          1.0   
3  The Oxfam Action Corps is a group of dedicated...         NaN          1.0   
4  Stop 'N' Swap reduces NYC's w

Checking for correlated features


In [127]:
wine = pd.read_csv('wine_types.csv')

# Print out the column correlations of the wine dataset
print(wine.corr())

# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine = wine.drop(to_drop, axis=1)

                                  Type   Alcohol  Malic acid       Ash  \
Type                          1.000000 -0.328222    0.437776 -0.049643   
Alcohol                      -0.328222  1.000000    0.094397  0.211545   
Malic acid                    0.437776  0.094397    1.000000  0.164045   
Ash                          -0.049643  0.211545    0.164045  1.000000   
Alcalinity of ash             0.517859 -0.310235    0.288500  0.443367   
Magnesium                    -0.209179  0.270798   -0.054575  0.286587   
Total phenols                -0.719163  0.289101   -0.335167  0.128980   
Flavanoids                   -0.847498  0.236815   -0.411007  0.115077   
Nonflavanoid phenols          0.489109 -0.155929    0.292977  0.186230   
Proanthocyanins              -0.499130  0.136698   -0.220746  0.009652   
Color intensity               0.265668  0.546364    0.248985  0.258887   
Hue                          -0.617369 -0.071747   -0.561296 -0.074667   
OD280/OD315 of diluted wines -0.788230

Exploring text vectors, part 1


In [128]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Example data
data = [
    'web designer',
    'urban design',
    'graphic designer',
    'software developer',
    'urban planning',
    'web developer',
    'architect',
    'interior designer',
    'urban gardening',
    'web development'
]

# Example vocabulary
vocab = {
    'web': 0,
    'designer': 1,
    'urban': 2,
    'graphic': 3,
    'software': 4,
    'developer': 5,
    'planning': 6,
    'architect': 7,
    'interior': 8,
    'gardening': 9
}

# Create the TfidfVectorizer with the provided vocabulary
tfidf_vec = TfidfVectorizer(vocabulary=vocab)

# Transform the text data into TF-IDF vectors
text_tfidf = tfidf_vec.fit_transform(data)

# Convert the sparse matrix to a dense matrix for easier inspection
dense_matrix = text_tfidf.toarray()

# Display the dense matrix
df = pd.DataFrame(dense_matrix, columns=tfidf_vec.get_feature_names_out())
print(df)


        web  designer     urban   graphic  software  developer  planning  \
0  0.707107  0.707107  0.000000  0.000000  0.000000   0.000000  0.000000   
1  0.000000  0.000000  1.000000  0.000000  0.000000   0.000000  0.000000   
2  0.000000  0.596775  0.000000  0.802409  0.000000   0.000000  0.000000   
3  0.000000  0.000000  0.000000  0.000000  0.761905   0.647689  0.000000   
4  0.000000  0.000000  0.596775  0.000000  0.000000   0.000000  0.802409   
5  0.658454  0.000000  0.000000  0.000000  0.000000   0.752621  0.000000   
6  0.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   
7  0.000000  0.596775  0.000000  0.000000  0.000000   0.000000  0.000000   
8  0.000000  0.000000  0.596775  0.000000  0.000000   0.000000  0.000000   
9  1.000000  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000   

   architect  interior  gardening  
0        0.0  0.000000   0.000000  
1        0.0  0.000000   0.000000  
2        0.0  0.000000   0.000000  
3        0.0  0.000

In [129]:
# Create the TfidfVectorizer with the provided vocabulary
tfidf_vec = TfidfVectorizer(vocabulary=vocab)

# Transform the text data into TF-IDF vectors
text_tfidf = tfidf_vec.fit_transform(data)

# Adjust the original_vocab dictionary to map words to indices
original_vocab = {v: i for i, v in enumerate(tfidf_vec.get_feature_names_out())}

# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))

    # Print the indices for debugging
    print("Indices in the sparse matrix:", vector[vector_index].indices)
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({original_vocab[i]: zipped[i] for i in vector[vector_index].indices if i in original_vocab})
    
    # Print the zipped_series for debugging
    print("Zipped series:", zipped_series)
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return zipped_index

# Print out the weighted words
print(return_weights(vocab, original_vocab, text_tfidf, vector_index=8, top_n=3))


Indices in the sparse matrix: [9 2]
Zipped series: Series([], dtype: object)
RangeIndex(start=0, stop=0, step=1)


Exploring text vectors, part 2


In [130]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = text_tfidf[:, list(filtered_words)]

Indices in the sparse matrix: [1 0]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [2]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [3 1]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [5 4]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [6 2]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [5 0]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [7]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [8 1]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [9 2]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [0]
Zipped series: Series([], dtype: object)


Training Naive Bayes with feature selection


In [None]:
# Split the dataset according to the class distribution of category_desc
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

Using PCA

In [144]:
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import pandas as pd

# Assuming 'wine' is your DataFrame
# Instantiate a PCA object
pca = PCA()

# Define the features and labels from the wine dataset
X = wine.drop("Type", axis=1)
y = wine["Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply PCA to the wine dataset X vector
pca_X_train = pca.fit_transform(X_train)
pca_X_test = pca.transform(X_test)  # Use transform, not fit_transform, on the test set

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)


[9.97802349e-01 2.02071713e-03 9.82348559e-05 5.53994004e-05
 1.10395648e-05 5.87233448e-06 3.13858204e-06 1.54420449e-06
 1.02927386e-06 3.90521513e-07 1.95535151e-07 8.99659634e-08]


Training a model with PCA


In [146]:
# Fit knn to the training data
knn.fit(pca_X_train, y_train)

# Score knn on the test data and print it out
print(knn.score(pca_X_test, y_test))

0.7777777777777778


## **Chapter 5 - Putting It All Together**

Checking column types


In [150]:
ufo = pd.read_csv('ufo_sightings_large.csv')
print(ufo.head(4))

               date         city state country      type    seconds  \
0   11/3/2011 19:21    woodville    wi      us   unknown  1209600.0   
1   10/3/2004 19:05    cleveland    oh      us    circle       30.0   
2   9/25/2009 21:00  coon rapids    mn      us     cigar        0.0   
3  11/21/2002 05:45     clemmons    nc      us  triangle      300.0   

    length_of_time                                               desc  \
0          2 weeks  Red blinking objects similar to airplanes or s...   
1           30sec.               Many fighter jets flying towards UFO   
2              NaN  Green&#44 red&#44 and blue pulses of light tha...   
3  about 5 minutes  It was a large&#44 triangular shaped flying ob...   

     recorded         lat       long  
0  12/12/2011  44.9530556 -92.291111  
1  10/27/2004  41.4994444 -81.695556  
2  12/12/2009  45.1200000 -93.287500  
3  12/23/2002  36.0213889 -80.382222  


In [152]:
# Print the DataFrame info
print(ufo.info())

# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype(float)

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            4935 non-null   datetime64[ns]
 1   city            4926 non-null   object        
 2   state           4516 non-null   object        
 3   country         4255 non-null   object        
 4   type            4776 non-null   object        
 5   seconds         4935 non-null   float64       
 6   length_of_time  4792 non-null   object        
 7   desc            4932 non-null   object        
 8   recorded        4935 non-null   object        
 9   lat             4935 non-null   object        
 10  long            4935 non-null   float64       
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 424.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Co

Dropping missing data


In [155]:
# Check how many values are missing in the length_of_time, state, and type columns
print(ufo[["length_of_time", "state", "type"]].isnull().sum())

# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo["length_of_time"].notnull() & 
          ufo["state"].notnull() & 
          ufo["type"].notnull()]

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

length_of_time    0
state             0
type              0
dtype: int64
(1866, 12)


Extracting numbers from strings


In [157]:
def return_minutes(time_string):

    pattern = re.compile(r'\d+')
    # Search for numbers in time_string
    num = re.search(pattern, time_string)
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo['minutes'] = ufo['length_of_time'].apply(lambda row: return_minutes(row))

# Take a look at the head of both of the columns
print(ufo[['length_of_time', 'minutes']].head(10))

    length_of_time  minutes
0  about 5 minutes      5.0
1       10 minutes     10.0
2        2 minutes      2.0
3        2 minutes      2.0
4        5 minutes      5.0
5       10 minutes     10.0
6        5 minutes      5.0
7        5 minutes      5.0
8        5 minutes      5.0
9          1minute      1.0


Identifying features for standardization


In [None]:
# Check the variance of the seconds and minutes columns
print(ufo[['seconds', 'minutes']].var())

# Log normalize the seconds column
ufo['seconds_log'] = np.log(ufo['seconds'])

# Print out the variance of just the seconds_log column
print(ufo['seconds_log'].var())

Encoding categorical variables


In [158]:
# Use Pandas to encode us values as 1 and others as 0
ufo['country_enc'] = ufo['country'].apply(lambda x: 1 if x == 'us' else 0)

# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

21


Features from dates


In [160]:
# Look at the first 5 rows of the date column
print(ufo['date'].head(5))

# Extract the month from the date column
ufo['month'] = ufo['date'].apply(lambda date: date.month)

# Extract the year from the date column
ufo['year'] = ufo['date'].apply(lambda date: date.year)

# Take a look at the head of all three columns
print(ufo[['date', 'month', 'year']].head())

0   2002-11-21 05:45:00
1   2012-06-16 23:00:00
2   2013-06-09 00:00:00
3   2013-04-26 23:27:00
4   2013-09-13 20:30:00
Name: date, dtype: datetime64[ns]
                 date  month  year
0 2002-11-21 05:45:00     11  2002
1 2012-06-16 23:00:00      6  2012
2 2013-06-09 00:00:00      6  2013
3 2013-04-26 23:27:00      4  2013
4 2013-09-13 20:30:00      9  2013


Text vectorization


In [161]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Take a look at the head of the desc field
print(ufo['desc'].head())

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo['desc'])

# Look at the number of columns this creates
print(desc_tfidf.shape)

0    It was a large&#44 triangular shaped flying ob...
1    Dancing lights that would fly around and then ...
2    Brilliant orange light or chinese lantern at o...
3    Bright red light moving north to north west fr...
4    North-east moving south-west. First 7 or so li...
Name: desc, dtype: object
(1866, 3422)


Selecting the ideal dataset


In [162]:
# Make a list of features to drop
to_drop = ['city', 'country', 'date', 'desc', 'lat','length_of_time', 'seconds', 'minutes', 'long', 'state', 'recorded']

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, top_n=4)


Indices in the sparse matrix: [2134 1320 2657 3123  147 1744 3275 1664]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [1787 2173 1645 1923 3007  340  395 1319 3379 3002 1794  910]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [  92  251 1690 1942 2176 2130  273 3298 3050 1102 2021 1363   15 3001
 1774  412 1738  718 2184 2188  604 1787  147]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [ 766  502 1003 3041 1539 3003 1360 2097 2472  596 3298 3050 2021 1787
 1664]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [2094 2330 2107 2943 3296 3015 2899 1063 1462  873 2766 1276 2793 2097
 3298 1102 2021 1738  718 2184 1787 3007  147]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [2751 2157  449  596 2021 1787 1744]
Zipped series: Series([], dtype: object)
Indices in the sparse matrix: [1025 3057 1915 2435 3315  351  637 1926  738 2031 2872 3003 3050 2021
 3007  340 1794 21

Modeling the UFO dataset, part 1


In [163]:
X = ufo_dropped.drop(['type', 'country_enc'], axis=1)
y = ufo_dropped['country_enc']

In [164]:
print(X.columns)


Index(['changing', 'chevron', 'cigar', 'circle', 'cone', 'cross', 'cylinder',
       'diamond', 'disk', 'egg', 'fireball', 'flash', 'formation', 'light',
       'other', 'oval', 'rectangle', 'sphere', 'teardrop', 'triangle',
       'unknown', 'month', 'year'],
      dtype='object')


In [165]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

0.8736616702355461


Modeling the UFO dataset, part 2

Finally, let's build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if we can predict the type of the sighting based on the text. We'll use a Naive Bayes model for this.



In [175]:
y = ufo_dropped['type']


In [176]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

# Assuming desc_tfidf and y are properly defined

nb = GaussianNB()

# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Print information for debugging
print("Shape of desc_tfidf:", desc_tfidf.shape)
print("Shape of filtered_text:", filtered_text.shape)
print("Filtered words:", filtered_words)

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit nb to the training sets
nb.fit(train_X, train_y)

# Print the score of nb on the test sets
print(nb.score(test_X, test_y))


Shape of desc_tfidf: (1866, 3422)
Shape of filtered_text: (1866, 0)
Filtered words: set()


ValueError: Found array with 0 feature(s) (shape=(1399, 0)) while a minimum of 1 is required by GaussianNB.