# Removing Duplicative and Sparse Data

The most basic task in data cleaning is detecting and removing erroneous data. This includes duplicative data and missing or unreliable data. This is not the most glamorous task but it is enormously important. As the old adage goes, "garbage in, garbage out." Being able to wrangle and clean messy datasets is absolutely paramount to be successful, and can set you apart from others in the data science/engineering field. 

To get started, create this dataframe of weather data. 

In [1]:
import pandas as pd

df = pd.DataFrame({
    "record_id" : ['DCMXP87EDE', 'DCMXP87EDE', 'ZMIFM3HX9G', 'HIVVXBAPS2', 'U1AA66UDES', 'B20KL5PW3L', 'FIZLY34KSQ'],
    "rain_inches" : [1.1, 1.1, 0.0, 0.0, 2.4, 11.2, 3.2],
    "tornado" : [0,0,1,0,0,0,0],
    "lightning" :[0,0,1,1,1,0,0],
    "wind_speed_mph" : [3.1, 3.1, 143.0, None, 8.1, 5.0, None],
    "severity" : ['CLEAR', 'CLEAR', 'SEVERE', 'MINOR', 'MINOR', 'MAJOR', None],
    "transmit_ind" :[1,1,1,1,1,1,1]
})

df

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity,transmit_ind
0,DCMXP87EDE,1.1,0,0,3.1,CLEAR,1
1,DCMXP87EDE,1.1,0,0,3.1,CLEAR,1
2,ZMIFM3HX9G,0.0,1,1,143.0,SEVERE,1
3,HIVVXBAPS2,0.0,0,1,,MINOR,1
4,U1AA66UDES,2.4,0,1,8.1,MINOR,1
5,B20KL5PW3L,11.2,0,0,5.0,MAJOR,1
6,FIZLY34KSQ,3.2,0,0,,,1


## Where Did the Data Come From? 

You may be tempted to dive right into writing Python code and wrangling datasets in Pandas dataframes, but let's step back for a brief moment and ask some questions. Where did this data come from? How was it collected? What sensors or data entry methods were used to collect it? Could the data be biased in any way or missing important variables? 

It is just as important, if not more so, to ask not just what the data says but also ask where it came from. This could reveal larger issues that are dirtying your data but are not detectable just by looking at the dataset alone. The data could be biased, or missing relevant data or variables for the problem being solved. If you have data that is full empty values (which we will discuss techniques for removing), you should fully understand why they are empty and whether there is a deeper problem in the process producing the data. For example, if a broken temperature sensor is recording `NA` or `NaN` values at a specific weather station, you should address fixing that sensor rather than just removing those records entirely. If a station is producing duplicate records, the software bug should be fixed rather than removing the duplicates.

There are some things you cannot quantify or apply a Pandas function to fix, and you must apply qualitative judgment to ask the right questions and address problems at the source. Once you have exhausted those questions and fully understand your dataset, then you can proceed accordingly. 

## Removing Duplicate Rows 

Let's print our dataframe of weather data. 

In [6]:
df

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity,transmit_ind
0,DCMXP87EDE,1.1,0,0,3.1,CLEAR,1
1,DCMXP87EDE,1.1,0,0,3.1,CLEAR,1
2,ZMIFM3HX9G,0.0,1,1,143.0,SEVERE,1
3,HIVVXBAPS2,0.0,0,1,,MINOR,1
4,U1AA66UDES,2.4,0,1,8.1,MINOR,1
5,B20KL5PW3L,11.2,0,0,5.0,MAJOR,1
6,FIZLY34KSQ,3.2,0,0,,,1


Notice above how we have some questionable  data, including the top two rows being duplicates and some missing `NaN` and `None` values. Let's focus on duplicates first. 

To get all the duplicates but the first instance of a row, use the `duplicated()` function. 

In [8]:
df.duplicated()

0    False
1     True
2    False
3    False
4    False
5    False
6    False
dtype: bool

You can flag all instances (including the first found instance) by setting `keep=False`.

In [13]:
df.duplicated(keep=False)

0     True
1     True
2    False
3    False
4    False
5    False
6    False
dtype: bool

If you want to find duplicates just based on one or more columns as the key, use the `subset()` function. Below we find duplicat records using only the `record_id` field. 

In [16]:
df.duplicated(subset=['record_id'])

0    False
1     True
2    False
3    False
4    False
5    False
6    False
dtype: bool

We could composite our condition with multiple fields if we wished, such as `record_id` and `rain_inches`. 

In [19]:
df.duplicated(subset=['record_id','rain_inches'])

0    False
1     True
2    False
3    False
4    False
5    False
6    False
dtype: bool

We could use the boolean `Series` returned in the examples above to extract only those rows into a new dataframe. However, we can also use the `drop_duplicates()` function to do this as well. It accepts the same arugments as `duplicatated()` and has an `inplace` parameter for replacing the existing dataframe. 

In [22]:
df.drop_duplicates(inplace=True)

And of course, you can always drop based on a subset. 

In [25]:
df.drop_duplicates(subset=['record_id'], inplace=True)
df

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity,transmit_ind
0,DCMXP87EDE,1.1,0,0,3.1,CLEAR,1
2,ZMIFM3HX9G,0.0,1,1,143.0,SEVERE,1
3,HIVVXBAPS2,0.0,0,1,,MINOR,1
4,U1AA66UDES,2.4,0,1,8.1,MINOR,1
5,B20KL5PW3L,11.2,0,0,5.0,MAJOR,1
6,FIZLY34KSQ,3.2,0,0,,,1


> Note there are nearly identical functions to handle duplicates for [Index](https://pandas.pydata.org/docs/reference/api/pandas.Index.duplicated.html#pandas.Index.duplicated) and [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.duplicated.html#pandas.Series.duplicated), also called `duplicated()` and `drop_duplicates()`. They operate much in the same way as the dataframe counterpart for these functions. 

## Remove Columns with One Value

Columns that have a single value are probably not going to be useful at all for machine learning and other analysis. Therefore they are candidate for removal as long as this is not an error. Notice how the `transmit_ind` is all 1's and this is not helpful. 

In [30]:
df

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity,transmit_ind
0,DCMXP87EDE,1.1,0,0,3.1,CLEAR,1
2,ZMIFM3HX9G,0.0,1,1,143.0,SEVERE,1
3,HIVVXBAPS2,0.0,0,1,,MINOR,1
4,U1AA66UDES,2.4,0,1,8.1,MINOR,1
5,B20KL5PW3L,11.2,0,0,5.0,MAJOR,1
6,FIZLY34KSQ,3.2,0,0,,,1


We can use the `nunique()` function to identify the number of unique values in each column as a series.

In [33]:
df.nunique()

record_id         6
rain_inches       5
tornado           2
lightning         2
wind_speed_mph    4
severity          4
transmit_ind      1
dtype: int64

We can iterate the series above and track which column indices to delete, based on whether they only have one unique value. 

In [36]:
# identify single-value columns to delete
delete_cols = [c for c,v in zip(df.columns, df.nunique()) if v == 1]
delete_cols

['transmit_ind']

Finally, we can remove those columns (there will only be on in this case) by passing them to the drop function. Make sure to specify we are dropping columns by specifying `axis=1`. 

In [39]:
df.drop(delete_cols, axis=1, inplace=True)
df

Unnamed: 0,record_id,rain_inches,tornado,lightning,wind_speed_mph,severity
0,DCMXP87EDE,1.1,0,0,3.1,CLEAR
2,ZMIFM3HX9G,0.0,1,1,143.0,SEVERE
3,HIVVXBAPS2,0.0,0,1,,MINOR
4,U1AA66UDES,2.4,0,1,8.1,MINOR
5,B20KL5PW3L,11.2,0,0,5.0,MAJOR
6,FIZLY34KSQ,3.2,0,0,,


## Remove Columns with Few Values

When dealing with categorical values, it should be unsurprising that there are few values. In our weather data, we only expect a `True` or `False` for boolean fields. We only expect 4 or so possible values for the `weather_severity` such as `MAJOR`, `MINOR`, `CLEAR`, and `SEVERE`. Rarely we should consider discrete variables like this to be too sparse to use. 

However, dealing with numerical/continous values (decimals) is a different story. When a numerical column has few values, there may not be much variance to make meaningful model predictions. If this is indeed the case, they should be removed. This is not always the case so be sure to remove when it is absolutely certain they do not add value. Sometimes the choice of model will impact this decision as well, as linear models often depend on some variance for a meaningful distribution of the data.

Let's bring in a different dataset for this example: a wine quality dataset with different wine attributes. 

In [43]:
wine_df = pd.read_csv('https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/regression/winequality-red.csv')
wine_df

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


One metric that might guide us to columns with low numbers of unique values is, for each column, the proportion of unique values out of all rows. Below we take each column, and divide the number of unique values by the number of rows. 

In [45]:
n_rows, n_cols = wine_df.shape

for i in range(n_cols):
    unique_num = wine_df.iloc[:, i].nunique()
    percentage = float(unique_num) / n_rows * 100 
    print(f'{i}, {unique_num}, {round(percentage,2)}%')

0, 96, 6.0%
1, 143, 8.94%
2, 80, 5.0%
3, 91, 5.69%
4, 153, 9.57%
5, 60, 3.75%
6, 144, 9.01%
7, 436, 27.27%
8, 89, 5.57%
9, 96, 6.0%
10, 65, 4.07%
11, 6, 0.38%


As you can see above, there are some columns with very low percentages of unique values. The categorical ones are to be expected, like the last column `quantity`. But some columns like `alcohol` (at position 11) and `free_sulfur_dioxide` (at position 5) are really low. 

Let's say we wanted to remove columns with 5% or less unique values. Let's adapt our `for` loop above to extract column labels that have a percentage of unique values of `.05` or less. 

In [49]:
delete_cols = []

n_rows, n_cols = wine_df.shape

for i in range(n_cols):
    unique_num = wine_df.iloc[:, i].nunique()
    percentage = float(unique_num) / n_rows  
    if percentage <= .05:
        delete_cols.append(wine_df.columns[i])
    
delete_cols

['free_sulfur_dioxide', 'alcohol', 'quality']

We will then take those three columns and then drop them. You will then notice those columns are removed. 

In [52]:
wine_df.drop(delete_cols, axis=1, inplace=True)
wine_df

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,total_sulfur_dioxide,density,ph,sulphates
0,7.4,0.700,0.00,1.9,0.076,34.0,0.99780,3.51,0.56
1,7.8,0.880,0.00,2.6,0.098,67.0,0.99680,3.20,0.68
2,7.8,0.760,0.04,2.3,0.092,54.0,0.99700,3.26,0.65
3,11.2,0.280,0.56,1.9,0.075,60.0,0.99800,3.16,0.58
4,7.4,0.700,0.00,1.9,0.076,34.0,0.99780,3.51,0.56
...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,44.0,0.99490,3.45,0.58
1595,5.9,0.550,0.10,2.2,0.062,51.0,0.99512,3.52,0.76
1596,6.3,0.510,0.13,2.3,0.076,40.0,0.99574,3.42,0.75
1597,5.9,0.645,0.12,2.0,0.075,44.0,0.99547,3.57,0.71


## Remove Columns with Low Variance

Another way to approach this problem of columns with few unique values is to calculate the variance and use that as a cutoff threshold. Recall that variance $ \sigma^2 $ is a measure in statistics that averages the squared differences between each observed value $ x_i $ and the mean $ \mu $ of those values. In other words, to calculate variance  square the difference between each data point $ x_i $ and the mean $ \mu $, sum them, and divide by number of elements $ n $. 

$$
\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \mu)^2} {n}
$$ 

Let's load our wine dataset again to start over and bring those removed columns back. 

In [56]:
wine_df = pd.read_csv('https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/regression/winequality-red.csv')
wine_df

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


The lower the variance, the less unique values we can expect. There is a helpful utility `VarianceThreshold` in scikit-learn that can be used to remove features based on variance. Typicaly, we want more variance for modeling purposes in statistics and machine learning. Having too little variance in a feature is not going to be useful. Let's declare an instance of `VarianceThreshold` here and set its threshold to `.05`. The higher this parameter is, the more columns it will eliminate due to higher cutoffs for variance. 

In [59]:
from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(threshold=.05)

Next let's extract just the input variable columns by selecting all but the last column (which is `quality`). Then we pass it to the `VariableThreshold`'s `fit_transform()` function to get the columns of data that met that threshold. 

In [61]:
X = wine_df.iloc[:,:-1]
X_threshold = vt.fit_transform(X)
X_threshold

array([[ 7.4,  1.9, 11. , 34. ,  9.4],
       [ 7.8,  2.6, 25. , 67. ,  9.8],
       [ 7.8,  2.3, 15. , 54. ,  9.8],
       ...,
       [ 6.3,  2.3, 29. , 40. , 11. ],
       [ 5.9,  2. , 32. , 44. , 10.2],
       [ 6. ,  3.6, 18. , 42. , 11. ]])

So how many columns made it through and met that variance threshold? Let's take a look at the shape and count the number of columns before and after the transformation.

In [63]:
print(f"NUM FEATURES BEFORE: {X.shape[1]}")
print(f"NUM FEATURES AFTER: {X_threshold.shape[1]}")

NUM FEATURES BEFORE: 11
NUM FEATURES AFTER: 5


So 6 columns were eliminated. Unfortunately, in this transformation our `DataFrame` was turned into a NumPy `ndarray`. Thankfully, there is a `get_support()` function on the `VarianceThreshold` to return the indices of the columns that pass the cutoff. We can then pass that back to the `columns` property to get the column indices, and then use that to select those columns off our dataframe. 

In [65]:
wine_df[wine_df.columns[vt.get_support(indices=True)]]

Unnamed: 0,fixed_acidity,residual_sugar,free_sulfur_dioxide,total_sulfur_dioxide,alcohol
0,7.4,1.9,11.0,34.0,9.4
1,7.8,2.6,25.0,67.0,9.8
2,7.8,2.3,15.0,54.0,9.8
3,11.2,1.9,17.0,60.0,9.8
4,7.4,1.9,11.0,34.0,9.4
...,...,...,...,...,...
1594,6.2,2.0,32.0,44.0,10.5
1595,5.9,2.2,39.0,51.0,11.2
1596,6.3,2.3,29.0,40.0,11.0
1597,5.9,2.0,32.0,44.0,10.2


As you can see, all but 5 of those columns have been eliminated and did not pass the variance threshold. 

## Exercise

Below is a dataframe of thermostat data. Complete the code by replacing question marks "?" to remove duplicative records and any columns with 3 or less unique values. 

In [72]:
import pandas as pd

df = pd.DataFrame({
    "record_id" : ['OVUTJE','OVUTJE','WI4QEX','WI4QEX','FS40NF','O64LIT','U888EA'],
    "temperature" : [65.2, 65.2, 47.2, 47.2, 57.4, 23.4, 27.5], 
    "humidity" : [.8, .8, .7, .7, .7, .7, .8],
    "stable" : [True, True, True, True, True, True, True]
})

# drop duplicates
df.drop_duplicates(inplace=True)

# remove columns with 3 or less unique values
delete_cols = []

n_rows, n_cols = df.shape

for i in range(n_cols):
    unique_num = df.iloc[:, i].nunique()
    if unique_num <= 3:
        delete_cols.append(df.columns[i])
    
df.drop(delete_cols, axis=1, inplace=True)
df

Unnamed: 0,record_id,temperature
0,OVUTJE,65.2
2,WI4QEX,47.2
4,FS40NF,57.4
5,O64LIT,23.4
6,U888EA,27.5
