<a href="https://colab.research.google.com/github/raulbs7/Machine-Learning-Techniques-Project/blob/master/NLP_Supervised_Project/3_Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3. FEATURE SELECTION

## 3.1 Imports

This is the part in which we determine the imports for this notebook.

In [66]:
import pandas as pd
import numpy as np
import io
import scipy.sparse as sp

from sklearn.feature_selection import SelectKBest, chi2, f_regression, SelectPercentile

from google.colab import files
from google.colab import drive

In [67]:
drive.mount('/drive')

Mounted at /drive


## 3.2 Importing dataset

For this phase, it will be used a CSV file called **vectorized_tweets.csv**, which contains the vectorization and features of the tweets that are been studied.

In [24]:
def upload_matrix ():
  uploaded = files.upload()
  for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))
    matrix = sp.load_npz(fn)
    return matrix

In [25]:
features = upload_matrix()

Saving vectorized_tweets.npz to vectorized_tweets (2).npz
User uploaded file "vectorized_tweets.npz" with length 3757108 bytes


In [26]:
features

<24783x8494 sparse matrix of type '<class 'numpy.float64'>'
	with 485730 stored elements in Compressed Sparse Row format>

Also, it is important for the selection or classification of the most important features, it is necessary to have some labeled data. This is going to be given by the csv file **processed_tweets.csv**

In [54]:
def upload_dataframes (index_fields):
  uploaded = files.upload()
  for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))
    df = pd.read_csv(io.StringIO(uploaded[fn].decode('utf-8')), index_col = index_fields)
    return df

In [55]:
tweets = upload_dataframes([])

Saving processed_tweets.csv to processed_tweets.csv
User uploaded file "processed_tweets.csv" with length 1558482 bytes


## 3.3 Selection of the best features

Now, it is going to be needed to know the number of features that represents the 30%, because the 70% of the features have to be removed.

The function used to select the features is the one by default, (**f_classif**).

In [30]:
cols_features = features.shape[1]
cols_features

8494

In [33]:
new_cols_features = int(cols_features * 0.3)
new_cols_features

2548

The matrix is going to be reduce to **2548** columns.

---
For the selection, it would be useful to use labeled data. This labels will be the values of the column **class** of **processed_tweets.csv**



In [63]:
selector = SelectKBest(k=new_cols_features)
new_features = selector.fit_transform(features, tweets['class'])
new_features

<24783x2548 sparse matrix of type '<class 'numpy.float64'>'
	with 381643 stored elements in Compressed Sparse Row format>

## 3.5 Exporting data

Like vectorization was done, the process is going to be same, saving this sparse matrix of seleceted features in other file called **selected_features.npz**. This is due to the large amount of data, taking this files in csv like 200 MB. On the other hand with these other files, it only reach 4 MB, so they are lighter for this type of interprocess steps.

In [69]:
sp.save_npz('/drive/My Drive/selected_features.npz', new_features)