# Exploring missing data

You've been given a dataset comprised of volunteer information from New York City, stored in the `volunteer` DataFrame. Explore the dataset using the plethora of methods and attributes pandas has to offer to answer the following question.

How many missing values are in the `locality` column?

In [3]:
import pandas as pd
volunteer = pd.read_csv("dataset/volunteer_opportunities.csv")
# volunteer.head()
volunteer.columns

Index(['opportunity_id', 'content_id', 'vol_requests', 'event_time', 'title',
       'hits', 'summary', 'is_priority', 'category_id', 'category_desc',
       'amsl', 'amsl_unit', 'org_title', 'org_content_id', 'addresses_count',
       'locality', 'region', 'postalcode', 'primary_loc', 'display_url',
       'recurrence_type', 'hours', 'created_date', 'last_modified_date',
       'start_date_date', 'end_date_date', 'status', 'Latitude', 'Longitude',
       'Community Board', 'Community Council ', 'Census Tract', 'BIN', 'BBL',
       'NTA'],
      dtype='object')

In [4]:
volunteer['locality'].isnull().sum()

70

# Dropping missing data

Now that you've explored the `volunteer` dataset and understand its structure and contents, it's time to begin dropping missing values.

In this exercise, you'll drop both columns and rows to create a subset of the `volunteer` dataset.

In [6]:
# Drop the Latitude and Longitude columns from volunteer
volunteer_cols = volunteer.drop(['Latitude' , 'Longitude'], axis=1)

# Drop rows with missing category_desc values from volunteer_cols
volunteer_subset = volunteer_cols.dropna(subset=['category_desc'])

# Print out the shape of the subset
print(volunteer_subset.shape)

(617, 33)


# Exploring data types

Taking another look at the dataset comprised of volunteer information from New York City, you want to know what types you'll be working with as you start to do more preprocessing.

Which data types are present in the `volunteer` dataset?

In [10]:
set(list(volunteer.dtypes))


{dtype('int64'), dtype('float64'), dtype('O')}

# Converting a column type

If you take a look at the `volunteer` dataset types, you'll see that the column `hits` is type `object`. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type `int`.

In [13]:
# Print the head of the hits column
print(volunteer["hits"].dtype)

# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype("int")

# Look at the dtypes of the dataset
print(volunteer.dtypes)

int32
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int32
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL 

# Class imbalance

In the `volunteer` dataset, you're thinking about trying to predict the `category_desc` variable using the other features in the dataset. First, though, you need to know what the class distribution (and imbalance) is for that label.

Which descriptions occur less than 50 times in the volunteer dataset?

In [14]:
volunteer['category_desc'].value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

# Stratified sampling

You now know that the distribution of class labels in the `category_desc` column of the volunteer dataset is uneven. If you wanted to train a model to predict `category_desc`, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

In [20]:
from sklearn.model_selection import train_test_split
# Create a DataFrame with all columns except category_desc
X = volunteer_subset.drop('category_desc', axis=1)

# Create a category_desc labels dataset
y = volunteer_subset['category_desc']

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Print the category_desc counts from y_train
print(y_train.value_counts())
# y_train

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64
