## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import datetime
import warnings

warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

%matplotlib inline

pd.set_option('display.max_columns', None)

Uploading files to google colab.

In [2]:
from google.colab import files
uploaded = files.upload()

ModuleNotFoundError: No module named 'google'

## Importing data

In [None]:
data = pd.read_csv('/content/unit4.csv') 

In [None]:
data.head()

In [None]:
data.shape

## Checking data types

In [None]:
data.info()

## Checking for null values

In [None]:
nulls = pd.DataFrame(data.isna().sum()/len(data))
nulls= nulls.reset_index()
nulls.columns = ['column_name', 'Percentage Null Values']
nulls.sort_values(by='Percentage Null Values', ascending = False)

## Checking the numerical values

In [None]:
numericals = data.select_dtypes(np.number)
numericals.head()

*   The INCOME might be an important factor in predicting the gift value, so even though it has a lot of null values, we will not drop the column.

*   In this exercise, we will try a more precise method to replace the null values, instead of simply replacing them by a constant value, mean or median.

*   We will use a similar method for the column TIMELAG .

## Checking the **income's** histogram

In [None]:
sns.histplot(data['INCOME'])

# Activity: Dealing with missing values.


Possible approaches:

**Drop:**

Let's consider the **gender** column. 

*   Can we guess in somehow what is the missing gender? **NO**. 
*   Can this column have any possible value compatible with a missing value? **NO**

Therefore, we are forced to drop the corresponding rows.

In [None]:
data['GENDER'].value_counts()

In [None]:
to_drop = data[~data['GENDER'].isin(['F','M'])].index.tolist()
data.drop(to_drop, inplace = True)
data.reset_index(drop=True)

In [None]:
data['GENDER'].value_counts()

**Replace:**

If we have some other information about that tells us we can do this, even if it is not the missing information. 

For example, if the data follow an approximately normal distribution, we might want to substitute the missing values with the mean. You always need to have something that "tells you" that you can replace the data.

Let's consider column **HOMEOWNR**. 

*   Can we guess the value? **NO**
*   Can this column have any possible value compatible with a missing value? **YES**: 'unknown'.

Therefore, we can replace the value in this column by **'U'** for 'unknown'.

In [None]:
data['HOMEOWNR'].value_counts()

In [None]:
np.unique(data['HOMEOWNR']).tolist()
data['HOMEOWNR'] = np.where(data['HOMEOWNR'] == ' ','U','H')

In [None]:
data.head()

## Interpolation

Let's see which kind interpolation between two consecutive missing values for the **'INCOME'** column is best.

First **LOOK AT YOUR DATA!!!**

In [None]:
data[['INCOME']].head()

In [None]:
sns.histplot(data['INCOME'])

Let's try first with linear interpolation

In [None]:
new_income_data_linear = data['INCOME'].interpolate(method='linear')
sns.histplot(new_income_data_linear)

Akima's interpolation

In [None]:
new_income_data_akima = data['INCOME'].interpolate(method='akima')
sns.histplot(new_income_data_akima)

Polynomial order 3.

In [None]:
new_income_data_poly = data['INCOME'].interpolate(method='polynomial', order=3)
sns.histplot(new_income_data_poly)

Imputing with the mean

In [None]:
# Testing interpolation method with mean and median methods
points2 = data['INCOME'].fillna(np.mean(data['INCOME']))
sns.histplot(points2)

Does it makes sense at all?

# Activity: Using linear regression to impute missing values.

You already know how to predict a numerical ammount. Therefore, you can use other columns in order to predict the missing values of the column of you interest. Use 'HV1' and 'IC1' columns to predict the missing values of 'INCOME'.

**Hint**: For sake of simplicity, when you have NaN s, you work with them as if they were a test set.

In [None]:
data.shape

In [None]:
data['INCOME'].isna().sum()

In [None]:
data['INCOME'].value_counts(dropna=False)

In [None]:
data.columns

In [None]:
pd.__version__

In [None]:
np.__version__

In [None]:
from sklearn.linear_model import LinearRegression

X = data[~data.INCOME.isna()][['HV1', 'IC1']]
y = data[~data.INCOME.isna()]['INCOME']

X_nulls = data[data.INCOME.isna()][['HV1', 'IC1']]
#X_nulls2 = data[data['INCOME']][['HV1', 'IC1']]
#X_nulls2 = data.loc['INCOME',['HV1','IC1']]

model = LinearRegression().fit(X,y)
income_pred = model.predict(X_nulls)
#income_pred2 = model.predict(X_nulls2)

pd.DataFrame(np.around(income_pred,0)).isna().sum()

#income_pred.isna().sum()
#pd.DataFrame(np.round(income_pred)).isna().sum()
data[data.INCOME.isnull()]['INCOME'] = np.around(income_pred,0) # Income values are integers, therefore, we need to round!
data.head()

Let's explore now the column 'TIMELAG'

In [None]:
sns.histplot(data['TIMELAG'])

In [None]:
sns.boxplot(x=data['TIMELAG'])

In [None]:
ax = sns.distplot(data['TIMELAG'])
ax2 = ax.twinx()
sns.boxplot(x=data['TIMELAG'], ax=ax2)
ax2.set(ylim=(-.5, 10))

Let's try some transformations to see if we can improve the distribution.

In [None]:
def log_transfom_clean_(x):
    if np.isfinite(x) and x!=0: # If the value is finite and != 0...
        return np.log(x)
    else:
        return np.NAN # We are returning NaNs so that we can replace them later

def sqrt_transfom_clean_(x):
    if np.isfinite(x) and x>=0:
        return np.sqrt(x)
    else:
        return np.NAN # We are returning NaNs so that we can replace them later

In [None]:
# Using the functions to check the distribution of transformed data
pd.Series(map(log_transfom_clean_, data['TIMELAG'])).hist()
plt.show()

pd.Series(map(sqrt_transfom_clean_, data['TIMELAG'])).hist()
plt.show()

As it can be seen in the figure, the logaritmic transformation works better than the sqrt. 

This could be expected given the extreme skewness of the data.

We could also us Box-Cox transformation, but probably the resulting distribution will be similar (alothough not the same, and even more, better). However, tihs is an illustrative example of how to proceed.

Let's assume that we don't know about Box-Cox and we want to apply the logaritmic transformation to the **'TIMELAG'** column.

In [None]:
data['TIMELAG'] = list(map(log_transfom_clean_, data['TIMELAG']))

Remember that our function, ignored the'0' and infinite values. The may want to replace them the mean of the NEW distribution.

In [None]:
data['TIMELAG'] = data['TIMELAG'].fillna(np.mean(data['TIMELAG']))
sns.distplot(data['TIMELAG'])
plt.show()

It's not perfectly Gaussian but we improved it a lot.

# Activity: Logarithmic transformation.

A logarithmic scale is common to visualize exponential data as they are the inverse function of each other, so the result would be a linear visualization. This is needed because we visualize exponential functions properly otherwise. As an example, you can see some corona virus visualizations, like [this one] (https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/4.1-COVID-Logarithmicvslinear.png). Check the log transform with the IC n columns.

In [None]:
sns.distplot(data['IC1'])
#sns.distplot(np.log(data['IC1']))

In [None]:
data['IC1'].describe()

In [None]:
fig,axes=plt.subplots(1,2)
sns.distplot(data['IC1'], ax=axes[0], axlabel='IC1')
sns.distplot(np.log(data['IC1']+1), ax=axes[1], axlabel='log(IC1+1)')

In [None]:
fig,axes=plt.subplots(1,2)
sns.distplot(data['IC2'], ax=axes[0], axlabel='IC2')
sns.distplot(np.log(data['IC2']+1), ax=axes[1], axlabel='log(IC2+1)')

In [None]:
fig,axes=plt.subplots(1,2)
sns.distplot(data['IC3'], ax=axes[0], axlabel = 'IC3')
sns.distplot(np.log(data['IC3']+1), ax=axes[1], axlabel = 'log(IC3+1)')

In [None]:
fig,axes=plt.subplots(1,2)
sns.distplot(data['IC4'], ax=axes[0], axlabel = 'IC4')
sns.distplot(np.log(data['IC4']+1), ax=axes[1], axlabel = 'log(IC4+1)')

Even after using the transformation, there is still some skewness in the column TIMELAG . We will remove the outliers only from the right side of the distribution plot.

In [None]:
sns.distplot(data['TIMELAG'])

Let's start knowing how many values will be removed if we decide to drop all the values beyond the upper wisker.

In [None]:
iqr = np.percentile(data['TIMELAG'],75) - np.percentile(data['TIMELAG'],25)
upper_limit = np.percentile(data['TIMELAG'],75) + 1.5*iqr
print("The upper wisker is at: %4.2f" % upper_limit)
outliers = data[data['TIMELAG'] > upper_limit].index.tolist()
print("The number of points outise the upper wisker is: ",len(outliers))

## Filtering outliers

Let's explore two different ways to drop outliers.

### Filter function. filter(lambda_function, column)

In [None]:
points = list(filter(lambda x: x < upper_limit, data['TIMELAG']))
len(points)

### Panda's approach

In [None]:
data = data[data['TIMELAG'] < upper_limit]
sns.distplot(data['TIMELAG'])
plt.show()

# Activity:

Let's learn how the following functions work::

*   Map
*   Filter
*   Reduce

## Map

This function, applies another given function to every element of a set.
It works **elementwise**.

In [None]:
list(map(str,range(15)))

## Filter

This other function also works elementwise but it returns the elements which met a condition.

In [None]:
list(filter(lambda x: x %2 == 0,range(15)))

## Reduce

This other function performs some computation on a list and returns the output of that computation applied all over the list. Is **NOT ELEMENTWISE**

In [None]:
from functools import reduce

lst = list(range(6))
print("The list is: ",lst)
print("The result of appliying the reduce over the list is: ",reduce(lambda a,b: a+b,lst))

# Lesson 1 Key Concepts

## Selecting categorical data

In [None]:
categoricals = data.select_dtypes(np.object)
categoricals.head()

Let's check the number of missing values for 'PVASTATE' column

In [None]:
data['PVASTATE'].value_counts()

Now for columns 'RECP3'

In [None]:
data['RECP3'].value_counts()

And finally for 'VETERANS' column

In [None]:
data['VETERANS'].value_counts()

Those columns have too much missing values. If we drop the rows containing those NA's we're in the risk of shrinking too much our dataset. Currently, we can't do too much with columns for which there are so many missing values, therefore let's drop them.

In [None]:
data = data.drop(columns=['PVASTATE', 'RECP3', 'VETERANS'], axis=1)

# Activity:

For the column 'DOMAIN', discuss which option is better to clean the rows where the values are empty.

*  Option 1: Filtering the rows with the empty values.
*  Option 2: Replacing the empty values with some other category, the most frequently represented value in that column.

In [None]:
data['DOMAIN'].value_counts()

In [None]:
unique_values = list(np.unique(data['DOMAIN']))
print(unique_values)

This column a lot of many possible different values. It's difficult to think how to impute this values and the number of missing values is quite small compared with the aggregated amount of non missing values. Therefore, dropping the missing values will not hurt.

In [None]:
data = data[data['DOMAIN'].isin(unique_values[1:])]
# Note after you filter, it is a good practice to reset the index
data = data.reset_index(drop=True)
data.head()

Let's check what we have now.

In [None]:
data["DOMAIN"].value_counts()

In [None]:
#filter(lambda x: x != " ",data['DOMAIN'])

# Lesson 2 Key Concepts

Let's consider the column 'GENDER'

In [None]:
data['GENDER'].value_counts()

No missing values as we cleaned it this morning ;)

Now let's see if there are differences between the the 'AVGGIFT' by gender.

In [None]:
# Visually analyzing categorical data with Target variable
sns.boxplot(x="GENDER", y="AVGGIFT", data=data)
plt.show()

In [None]:
ax1 = sns.distplot(data['AVGGIFT'][data['GENDER'] == 'M'], color = 'Red')
ax2 = sns.distplot(data['AVGGIFT'][data['GENDER'] == 'F'], color = 'Blue')
plt.xlim(0, 200)

Both groups doesn't look too different. They have a few outliers.

**HOWEVER**, be carefull. You don't know now how the distributions look inside the boxes!

Let's check the average gift by gender.

In [None]:
sns.barplot(x="GENDER", y="AVGGIFT", data=data)
plt.show()

We can conclude that the company doesn't make significant differences on the gift according to the gender. Therefore let's remove this column.

In [None]:
data = data.drop(columns=['GENDER'], axis=1)

# Activity:

There is a more efficient way to use map over pandas dataframes, and it is called [apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

In [None]:
#data['GENDER'] = data['GENDER'].apply(lambda x: 'other' if x in ['',' ' ,'U', 'C', 'J', 'A'] else x)

# Lesson 3 Key Concepts.

## Dealing with a large number of categories

Let's inspect the column 'STATE'

In [None]:
state_values = list(np.unique(data['STATE']))
state_values

Huuummm, there are typos... Which is 'AA'?. A current list of abreviations can be found [here](https://www.ssa.gov/international/coc-docs/states.html)

In [None]:
real_states = ['AL','AK','AS','AZ','AR','CA','CO','CT','DE','DC','FL','GA','GU','HI','ID','IL','IN','IA','KS',
'KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','MP','OH','OK','OR',
'PA','PR','RI','SC','SD','TN','TX','UT','VT','VA','VI','WA','WV','WI','WY']

First we are going to filter out values which doesn't correspond to any of the previous list.

In [None]:
data = data[data['STATE'].isin(real_states)]

Now, let's check the frequencies of each state.

In [None]:
vals = pd.DataFrame(data['STATE'].value_counts())
vals = vals.reset_index()
vals.columns = ['state', 'counts']
vals

As we can see there are states which are under represented. We have several options.

*  Group the states in smaller groups.
*  Group under represented states in a single group.
*  A combination of both.

We will use the last option.

Given the previous state frequencies, can you guess any business insigth?

First, let's get the states which are under represented.

In [None]:
group_states_df = vals[vals['counts']<2500]
group_states = list(group_states_df['state'])
group_states

In [None]:
def clean_state(x):
    if x in group_states:
        return 'other'
    else:
        return x

data['STATE'] = list(map(clean_state, data['STATE']))

What are now our final groups?

In [None]:
new_state_values = list(np.unique(data['STATE']))
new_state_values

## Binning numerical columns.

Let's see rigth now the 'IC2' column. This column is numerical, but we would like to make it categorical using. **binning**  



In [None]:
ic2_labels = ['Low', 'Moderate', 'High', 'Very High']
data['IC2_NEW'] = pd.cut(data['IC2'],4, labels=ic2_labels) # see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html?highlight=cut#pandas.cut
data['IC2_NEW'].value_counts()

# Activity:

Use the column MDMAUD to reduce the number of categories to two (XXXX and other).

# Lesson 4 Key Concepts.

Regular expressions in Python. 

[see here](https://docs.python.org/3/library/re.html)

[practice here](https://pythex.org/)

*: Matches previous character 0 or more times.

+: Matches previous character 1 or more times.

?: Matches previous character 0 or 1 times (optional).

{}: Matches previous characters however many times specified within:.

{n}: Exactly n times.

{n,}: At least n times.

{n,m}: Between n and m times.

In [None]:
import re

Let's see some examples.

In [None]:
text = "The complicit caat interacted with the other cats exactly as we expected."
pattern = "c*t"
print(re.findall(pattern, text))

In [None]:
text = "The complicit caat interacted with the other cats exactly as we expected."

pattern = 'c*a*t'
print(re.findall(pattern, text))

In [None]:
text = "The complicit caaaat ct interacted with the other cats exactly as we expected."
pattern = "a+"
print(re.findall(pattern, text))

In [None]:
text = "Is the correct spelling color or colour?"
pattern = "colou?r"
print(re.findall(pattern, text))

In [None]:
text = "We can match the following: aaaawwww, aww, awww, awwww, awwwww"
pattern = "aw{3}"
print(re.findall(pattern, text))

In [None]:
text = "Let's see how we can match the following: aaw, aaww, aawww, awwww, awwwww"
pattern = "aw{1,}"
print(re.findall(pattern, text))

In [None]:
pattern = "a{2,}w{2,}"
print(re.findall(pattern, text))

# Activity:

Create a function to automate the process of reducing the number of values of a categorical column.