# Course Description

This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You'll learn how to standardize your data so that it's in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

## Introduction to preprocessing

#### Exploring missing data
You've been given a dataset comprised of volunteer information from New York City, stored in the volunteer DataFrame. Explore the dataset using the plethora of methods and attributes pandas has to offer to answer the following question.

How many missing values are in the locality column?

In [1]:
import pandas as pd
volunteer = pd.read_csv('volunteer_opportunities.TXT')
volunteer

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
660,5640,50193,3,0,Volunteer for NYLAG's Food Stamps Project,197,"Volunteers needed to file for fair hearings, d...",,2.0,Helping Neighbors in Need,...,November 15 2012,approved,,,,,,,,
661,5218,38711,10,0,Iridescent Science Studio Open House Volunteers,113,Come out to the South Bronx to help us hold ou...,,1.0,Strengthening Communities,...,April 13 2011,approved,,,,,,,,
662,5541,47820,1,0,French Translator,145,Volunteer needed to translate written material...,,2.0,Helping Neighbors in Need,...,September 01 2011,approved,,,,,,,,
663,5398,40722,2,0,Marketing & Advertising Volunteer,330,World Cares Center is looking for individuals ...,,1.0,Strengthening Communities,...,May 31 2012,approved,,,,,,,,


#### Exploring missing data
You've been given a dataset comprised of volunteer information from New York City, stored in the volunteer DataFrame. Explore the dataset using the plethora of methods and attributes pandas has to offer to answer the following question.

How many missing values are in the locality column?

In [2]:
volunteer.locality.isna().sum()

70

a.665<br>
b.595<br>
<strong>c.70</strong><br>
d.35

Great work! Exploring your data is a crucial first step before preprocessing. Time to start removing missing data!

#### Dropping missing data
Now that you've explored the volunteer dataset and understand its structure and contents, it's time to begin dropping missing values.

In this exercise, you'll drop both columns and rows to create a subset of the volunteer dataset.

In [3]:
# first look at the shape of the data
print("First shape:", volunteer.shape)

# Drop the Latitude and Longitude columns from volunteer
volunteer_cols = volunteer.drop(['Latitude','Longitude'], axis = 1)

# Drop rows with missing category_desc values from volunteer_cols
volunteer_subset = volunteer_cols.dropna(subset = ['category_desc'])

# Print out the shape of the subset
print("Second shape:", volunteer_subset.shape)

First shape: (665, 35)
Second shape: (617, 33)


Nice work! Remember that you can use Boolean indexing to effectively subset DataFrames.

## Working with data types

#### Exploring data types
Taking another look at the dataset comprised of volunteer information from New York City, you want to know what types you'll be working with as you start to do more preprocessing.

Which data types are present in the volunteer dataset?

In [4]:
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

a.Floats and integers only<br>
b.Integers only<br>
<strong>c.Floats, integers, and objects</strong><br>
d.Floats only

Correct! All three of these types are present in the DataFrame.

#### Converting a column type
If you take a look at the volunteer dataset types, you'll see that the column hits is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int.

In [5]:
# Print the head of the hits column
print(volunteer["hits"].head(), "\n")

# Convert the hits column to type int
volunteer["hits"] = volunteer.hits.astype('int64') #### it was int32

# Look at the dtypes of the dataset
print(volunteer.dtypes)




0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64 

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float6

Nice work! You can use astype to convert between a variety of types.

## Training and test sets

#### Class imbalance
In the volunteer dataset, you're thinking about trying to predict the category_desc variable using the other features in the dataset. First, though, you need to know what the class distribution (and imbalance) is for that label.

Which descriptions occur less than 50 times in the volunteer dataset?

In [6]:
volunteer.category_desc.value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

#### Possible Answers

a.Emergency Preparedness<br>
b.Health<br>
c.Environment<br>
<strong>d.Environment and Emergency Preparedness</strong>

Correct! Both Emergency Preparedness and Environment occur less than 50 times.

#### Stratified sampling
You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer # we import 'SimpleImputer' because it gives an error while train_test_splitting
                                         # due to missing values. 

# Create a DataFrame with all columns except category_desc
X = volunteer.drop('category_desc', axis=1)

# Create a category_desc labels dataset
volunteer['category_desc'] = volunteer['category_desc'].to_numpy().reshape(-1,1)
y = volunteer['category_desc']

imputer = SimpleImputer(strategy='most_frequent') # we do 'most_frequent' since there are categorical datas in the dataset.
                                                  # we cannot do mean,median or smth related to numeric data
X = imputer.fit_transform(X)
y = imputer.fit_transform(y)

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state=42) # we need to fix missing values
                                                    # problem before train test splitting.

# Print the category_desc counts from y_train
print(y_train['category_desc'].value_counts())

print(y_train['category_desc'].value_counts() == volunteer.category_desc.value_counts() * 0.75)

ValueError: Expected 2D array, got 1D array instead:
array=[nan 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Environment' 'Environment'
 'Strengthening Communities' 'Helping Neighbors in Need' nan 'Health'
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Education' 'Helping Neighbors in Need' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities' 'Environment'
 'Environment' 'Health' 'Health' 'Strengthening Communities' nan nan
 'Health' nan 'Environment' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' 'Health'
 'Strengthening Communities' 'Environment' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Environment' 'Strengthening Communities'
 'Strengthening Communities' 'Education' 'Helping Neighbors in Need'
 'Environment' 'Education' 'Helping Neighbors in Need'
 'Emergency Preparedness' 'Environment' 'Strengthening Communities'
 'Strengthening Communities' nan 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Education' nan 'Strengthening Communities'
 'Strengthening Communities' 'Education' 'Health' 'Education' 'Education'
 'Environment' 'Education' 'Education' 'Strengthening Communities'
 'Strengthening Communities' 'Emergency Preparedness'
 'Strengthening Communities' 'Health' 'Health' 'Health'
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Health' 'Health' 'Education' 'Education' 'Health'
 'Strengthening Communities' 'Education' 'Strengthening Communities' nan
 'Strengthening Communities' 'Helping Neighbors in Need' 'Health'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Education' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Environment' 'Strengthening Communities'
 'Education' 'Helping Neighbors in Need' 'Strengthening Communities'
 'Health' 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Education' 'Strengthening Communities'
 'Health' 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities' 'Education'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Health' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Education' nan 'Education'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Helping Neighbors in Need' 'Environment'
 'Strengthening Communities' 'Education' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Strengthening Communities' nan
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Strengthening Communities' 'Education' 'Health' 'Education'
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Health' 'Education'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' nan 'Strengthening Communities'
 'Strengthening Communities' 'Environment' 'Education'
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Helping Neighbors in Need' nan
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Education' 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Health' 'Strengthening Communities'
 'Education' 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' nan 'Emergency Preparedness' 'Education'
 'Strengthening Communities' 'Strengthening Communities' 'Environment'
 'Health' 'Helping Neighbors in Need' 'Education' nan
 'Strengthening Communities' 'Emergency Preparedness'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Environment' 'Helping Neighbors in Need'
 'Health' nan 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' 'Health'
 'Helping Neighbors in Need' 'Health' 'Strengthening Communities'
 'Environment' 'Strengthening Communities' 'Strengthening Communities' nan
 'Helping Neighbors in Need' 'Health' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' nan 'Strengthening Communities'
 'Helping Neighbors in Need' 'Emergency Preparedness' 'Education'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Helping Neighbors in Need' 'Education' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Education' 'Health'
 'Helping Neighbors in Need' 'Environment' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Emergency Preparedness'
 'Emergency Preparedness' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' nan
 'Strengthening Communities' 'Education' 'Helping Neighbors in Need'
 'Strengthening Communities' nan 'Health' 'Education'
 'Emergency Preparedness' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Education' 'Education' nan 'Strengthening Communities' 'Health'
 'Strengthening Communities' 'Education' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Strengthening Communities' 'Health' 'Strengthening Communities' nan
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Environment' 'Health' 'Education'
 'Strengthening Communities' 'Helping Neighbors in Need' 'Education'
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need' 'Environment'
 'Emergency Preparedness' 'Strengthening Communities' 'Education'
 'Strengthening Communities' 'Strengthening Communities' 'Education' nan
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Education' 'Strengthening Communities'
 'Education' 'Strengthening Communities' 'Strengthening Communities'
 'Education' 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Environment' 'Strengthening Communities'
 'Education' 'Strengthening Communities' nan 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Strengthening Communities' nan 'Education'
 'Education' 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Health' 'Strengthening Communities'
 'Strengthening Communities' 'Environment' nan 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need' nan
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' nan 'Strengthening Communities' 'Health'
 'Emergency Preparedness' 'Strengthening Communities'
 'Strengthening Communities' 'Education' 'Health'
 'Strengthening Communities' nan 'Strengthening Communities' 'Education'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Strengthening Communities' nan
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Health' 'Helping Neighbors in Need' 'Health'
 'Education' 'Strengthening Communities' 'Education'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' nan 'Helping Neighbors in Need'
 'Strengthening Communities' 'Environment' nan nan 'Health' 'Health'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities' 'Education'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Strengthening Communities' 'Strengthening Communities'
 'Emergency Preparedness' nan 'Environment' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Emergency Preparedness' 'Emergency Preparedness' nan
 'Strengthening Communities' 'Education' 'Education'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Education' 'Strengthening Communities' nan 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Health' 'Health' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Environment' 'Education'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Education' 'Education' 'Environment'
 'Environment' 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Health' 'Education' 'Education' 'Strengthening Communities' 'Health' nan
 'Environment' 'Helping Neighbors in Need' nan 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Environment' 'Education'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' 'Education' nan
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Helping Neighbors in Need' 'Education'
 'Education' 'Strengthening Communities' 'Strengthening Communities'
 'Education' 'Education' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Health' 'Strengthening Communities' 'Health' 'Education'
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need' 'Health' nan
 'Strengthening Communities' 'Strengthening Communities' nan 'Education'
 'Health' 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Emergency Preparedness'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' nan 'Education' 'Helping Neighbors in Need'
 'Strengthening Communities' nan 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Health' 'Health' 'Strengthening Communities' 'Health'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' nan 'Education'
 'Strengthening Communities' 'Strengthening Communities' 'Environment'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' nan
 'Emergency Preparedness' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities' 'Health' 'Health'
 nan 'Helping Neighbors in Need' 'Strengthening Communities' 'Education'
 'Helping Neighbors in Need' 'Education' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Strengthening Communities' 'Health'
 'Helping Neighbors in Need' 'Education' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Education' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities' 'Education'
 'Helping Neighbors in Need' 'Health' 'Strengthening Communities'
 'Environment' nan 'Education' 'Strengthening Communities' nan
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Helping Neighbors in Need'
 'Strengthening Communities' nan 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Environment' 'Helping Neighbors in Need'
 'Strengthening Communities' 'Environment' 'Strengthening Communities'
 'Strengthening Communities' 'Strengthening Communities'
 'Strengthening Communities' 'Health' 'Helping Neighbors in Need'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Helping Neighbors in Need' 'Strengthening Communities'
 'Strengthening Communities'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Great job! You'll use train_test_split() frequently while building models, so it's useful to be familiar with the function.