## Selecting relevant features

In this exercise, you'll identify the redundant columns in the volunteer dataset, and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if you explore the volunteer dataset in the console, you'll see three features which are related to location: locality, region, and postalcode. They contain related information, so it would make sense to keep only one of the features.

Take some time to examine the features of volunteer in the console, and try to identify the redundant features.

    Create a list of redundant column names and store it in the to_drop variable:
        Out of all the location-related features, keep only postalcode.
        Features that have gone through the feature engineering process are redundant as well.
    Drop the columns in the to_drop list from the dataset.
    Print out the .head() of volunteer_subset to see the selected columns.

In [None]:
# Create a list of redundant column names to drop
to_drop = ["locality", "region", "vol_requests","category_desc",  "created_date"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(columns=to_drop, axis=1)

# Print out the head of volunteer_subset
print(volunteer_subset.head())

## Checking for correlated features

You'll now return to the wine dataset, which consists of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

    Print out the Pearson correlation coefficients for each pair of features in the wine dataset.
    Drop any columns from wine that have a correlation coefficient above 0.75 with at least two other columns.

In [None]:
# Print out the column correlations of the wine dataset
print(wine.corr())

# Drop that column from the DataFrame
wine = wine.drop('Flavanoids', axis=1)

print(wine.head())

## Exploring text vectors, part 1

Let's expand on the text vector exploration method we just learned about, using the volunteer dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf vector.

    Add parameters called original_vocab, for the tfidf_vec.vocabulary_, and top_n.
    Call pd.Series() on the zipped dictionary. This will make it easier to operate on.
    Use the .sort_values() function to sort the series and slice the index up to top_n words.
    Call the function, setting original_vocab=tfidf_vec.vocabulary_, setting vector_index=8 to grab the 9th row, and setting top_n=3, to grab the top 3 weighted words.

In [None]:
# Add in the rest of the arguments
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab=vocab, original_vocab=tfidf_vec.vocabulary_, vector=text_tfidf, vector_index=8, top_n=3))

## Exploring text vectors, part 2

Using the return_weights() function you wrote in the previous exercise, you're now going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

    Call return_weights() to return the top weighted words for that document.
    Call set() on the returned filter_list to remove duplicated numbers.
    Call words_to_filter, passing in the following parameters: vocab for the vocab parameter, tfidf_vec.vocabulary_ for the original_vocab parameter, text_tfidf for the vector parameter, and 3 to grab the top_n 3 weighted words from each document.
    Finally, pass that filtered_words set into a list to use as a filter for the text vector.

In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = text_tfidf[:, list(filtered_words)]

## Training Naive Bayes with feature selection

You'll now re-run the Naive Bayes text classification model that you ran at the end of Chapter 3 with our selection choices from the previous exercise: the volunteer dataset's title and category_desc columns.

    Use train_test_split() on the filtered_text text vector, the y labels (which is the category_desc labels), and pass the y set to the stratify parameter, since we have an uneven class distribution.
    Fit the nb Naive Bayes model to X_train and y_train.
    Calculate the test set accuracy of nb.

In [None]:
# Split the dataset according to the class distribution of category_desc
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

## Using PCA

In this exercise, you'll apply PCA to the wine dataset, to see if you can increase the model's accuracy.

    Instantiate a PCA object.
    Define the features (X) and labels (y) from wine, using the labels in the "Type" column.
    Apply PCA to X_train and X_test, ensuring no data leakage, and store the transformed values as pca_X_train and pca_X_test.
    Print out the .explained_variance_ratio_ attribute of pca to check how much variance is explained by each component.

In [None]:
# Instantiate a PCA object
pca = PCA()

# Define the features and labels from the wine dataset
X = wine.drop("Type", axis=1)
y = wine["Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply PCA to the wine dataset X vector
pca_X_train = pca.fit_transform(X_train)
pca_X_test = pca.transform(X_test)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

## Training a model with PCA

Now that you have run PCA on the wine dataset, you'll finally train a KNN model using the transformed data.

    Fit the knn model to the PCA-transformed features, pca_X_train, and training labels, y_train.
    Print the test set accuracy of the knn model using pca_X_test and y_test.

In [None]:
# Fit knn to the training data
knn.fit(pca_X_train, y_train)

# Score knn on the test data and print it out
knn.score(pca_X_test, y_test)