
# Jupyter notebook for the case study (using Python 3)

Importing the necessary libraries: 
* Pandas package to efficiently work with DataFrames
* NumPy package for math / linear algebra
* Datetime to work with date/time data
* Ridge (ridge regression) - ML-model to determine key factors
* Train_test_split - to split data in training and test set
* cros_val_score to perform cross-validation when calibrating the model
* StandardScaler to normalize the data
* matplotlib (plt) for visualisations

In [104]:
import pandas as pd
import numpy as np
import datetime as dt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#from sklearn.model_selection import KFold ---- delete?
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt


## Task 1

**_1) Setup_**

defining dataset names. Can change names to add other datasets.

In [2]:
#uncomment for testing with small datasets
name_dataset_0 = 'small_app_dataset.csv' # 'app_dataset.csv'
name_dataset_1 = 'small_dataset_1.csv' # 'dataset_1.csv'
name_dataset_2 = 'small_dataset_2.csv' # 'dataset_2.csv'

In [3]:
name_dataset_0 = 'app_dataset.csv'
name_dataset_1 = 'dataset_1.csv'
name_dataset_2 = 'dataset_2.csv'

defining key names

In [4]:
key1 = 'key1'
key2 = 'key2'
key_names = [key1, key2]

saving CSV fomratted datasets as Pandas dataframes

In [5]:
dataset_0 = pd.read_csv(name_dataset_0, sep=';')
dataset_1 = pd.read_csv(name_dataset_1, sep=';')
dataset_2 = pd.read_csv(name_dataset_2, sep=';')

**_2) Investigating the datasets - checking how many rows, columns and elements they have_**

function to print the number of columns, rows and elements for each dataset

In [6]:
def print_col_row_and_cell_count(df):
    row_count, column_count = df.shape
    element_count = column_count*row_count
    print('column count:  ', column_count)
    print('row count:     ', row_count)
    print('element count: ', element_count)
    print()

total number of row and column count for each dataset (including NA values)

In [7]:
print('1) dataset 0')
print_col_row_and_cell_count(dataset_0)
print('2) dataset 1')
print_col_row_and_cell_count(dataset_1)
print('3) dataset 2')
print_col_row_and_cell_count(dataset_2)

1) dataset 0
column count:   5
row count:      798
element count:  3990

2) dataset 1
column count:   169
row count:      14571
element count:  2462499

3) dataset 2
column count:   37
row count:      10137
element count:  375069



**_3) Joining the datasets_**

In [8]:
dataset_0_and_1 = pd.merge(dataset_0, dataset_1, how='left', on=key2)

In [9]:
dataset_full_not_cleaned = pd.merge(dataset_0_and_1, dataset_2, how='left', on=key1)

In [10]:
dataset_full_not_cleaned.to_csv('output_dataset_full_not_cleaned.csv')

In [11]:
print('dataset_full - before cleaning NAs')
print_col_row_and_cell_count(dataset_full_not_cleaned)

dataset_full - before cleaning NAs
column count:   209
row count:      798
element count:  166782



**_4) Dropping columns with keys. Removing columns and rows containing many NA values. Saving the final dateset to CSV file_**

After the join is done, keys are not needed. Dropping them.

In [12]:
dataset_full_not_cleaned_keys_dropped = dataset_full_not_cleaned.drop(key_names, axis=1)

Function to deal with NA values. It will drop rows and columns if the amount of non-NA values in a given column or row is below a given threshold. By default it is 20% for columns and 5% for rows.

In [146]:
def drop_rows_and_cols_with_NA_below_thresholds(input_df, key_names=key_names, col_thresh=0.20, row_thresh=0.05):
    df = input_df.copy(deep=True)
    
    number_of_cols = len(list(df.columns))
    row_threshold_integer = round(row_thresh * number_of_cols)
    df = df.dropna(axis=0, thresh=row_threshold_integer) # droping rows that have non-NA cell count below threshold
    
    number_of_rows = len(df)
    col_threshold_integer = round(col_thresh * number_of_rows)
    output_df = df.dropna(axis=1, thresh=col_threshold_integer).loc[:] # droping columns that have non-NA cell count below threshold
    return output_df

In [14]:
dataset_full_clean = drop_rows_and_cols_with_NA_below_thresholds(dataset_full_not_cleaned_keys_dropped, 
                                                                 col_thresh=0.20, row_thresh=0.05)

In [15]:
print('dataset_full_clean - after some columns and rows with many missing values are removed')
print_col_row_and_cell_count(dataset_full_clean)

dataset_full_clean - after some columns and rows with many missing values are removed
column count:   60
row count:      772
element count:  46320



Saving the final dataset as a CSV file

In [16]:
dataset_full_clean.to_csv('output_dataset_full_clean.csv')

**_5) Observations on data integrity _**

Overall, we see that a lot of data is not used. In the final table we have 798 rows (the same as in the 'master' dataset_0, because that dataset is used in left outer join). Dataset1 has 14571 rows, and dataset2 - 10137. Since response variable is available only for these 798 rows, we have to ignore most of the rows from dataset1 and dataset2. 

On top of that, there are a lot of missing values (NA), especially in the dataset1. The combined dataset has 209 columns, before the columns with many NAs are removed. After I remove them, applying 20% threshold, only 62 columns remain. [UPDATE - provide counts on NA in each table. Maybe update print function to show NA cells as well]

**_3) .....handling NA in some other way???...... _**

## Task 2

**_1) Setup_**

It is important to clean the dataset and to make various transformations before performing any analysis on it.

(a) Defining the name of the target variable

In [17]:
target = 'response'

(b) Defining function to get all column names except for the target and key columns. Will allow to dynamically analyze dataframes without the need to know exact columns they have

In [66]:
def get_col_names_without_target(dataframe, target = target):
    column_names_list = list(dataframe.columns)
    if target in column_names_list:
        column_names_list.remove(target)
    return column_names_list

(c) We want to determine which factors are the most important in predicting target variable (response). Many variables still has too many NAs, so I will use more agressive column threshold (60%) to remove columns/factors with many missing values. Otherwise, we would introduce too much bias if we would try to impute them all.

In [19]:
dataset_full = drop_rows_and_cols_with_NA_below_thresholds(dataset_full_clean, col_thresh=0.60, row_thresh=0.05)
print_col_row_and_cell_count(dataset_full)

column count:   38
row count:      772
element count:  29336



(d) Python uses '.' as a decimal point. However, in datasets sometimes we get ',' as a decimal point. Need to replace ',' with '.'. After this is done, will need to convert floats stored as string to Python floats. 

In [20]:
def replace_commas_with_dots_in_string(single_string):
    if type(single_string) == str:
        single_string = single_string.replace(',','.')
    return single_string

In [21]:
# applying a function on each cell of a dataframe
dataset_full = dataset_full.applymap(replace_commas_with_dots_in_string)

In [22]:
#function to convert floats stored as string to floats
def convert_floats_in_string_to_floats(element):
    if type(element) == str:
        try:
            return float(element)
        except (ValueError, TypeError):
            return element
    return element

In [23]:
dataset_full = dataset_full.applymap(convert_floats_in_string_to_floats)

(e) Some columns might contain dates in string format. I will convert those to floats. It is done by firstly converting string dates to datetime format. Then from those datetimes I substract epoch date (1 jan 1970) and convert it to seconds, which is in float format. Essentially, each cell with a date after the transformation will show how many seconds has passed after 1 jan 1970 till this cell's initial date. This number is in float, so regression ML algorithms (linear regression, random forest regressor, etc) can be applied on it.

The function below will do this transformation. It is a vectorized function, so it is efficient. Also, it will convert only those columns, that initially contain dates in string, otherwise it will not change the columns. Thus, it is very general and would work on various datasets.

In [24]:
def string_dates_to_sec_after_epoch_as_float(input_col):
    if input_col.dtype=='O': #in pandas dataframe columns containing Strings, has type Object, or 'O'
        col_datetime=pd.to_datetime(input_col, format='%Y-%m-%d %H:%M', errors='ignore') #convert to datetime only if format is '%Y-%m-%d %H:%M'
        if col_datetime.dtype=='datetime64[ns]': 
            epoch_timestamp_col = col_datetime - dt.datetime(1970, 1, 1)
            sec_float_col = epoch_timestamp_col / np.timedelta64(1, 's')
            return sec_float_col
        return col_datetime
    else:
        return input_col           

In [25]:
dataset_full_time_converted = dataset_full.apply(string_dates_to_sec_after_epoch_as_float)

(f) We have converted string columns that contain dates and floats. Now, the remaining columns with string (text) contain only categorical variables (e.g. 'big', 'small' and 'medium'). We need to convert this information to numerical data. I will do it by  creating a binary variable for each category. Binary variable (dummy) being 1 means that a given record belongs to a given category, and 0 indicates that it does not belong. If the value is missing, then a new category ('missing') is created. The initial column with strings is dropped. For example, column B contains 'yes', 'no' and 'N/A', then column B is dropped, and 3 new columns are created: B_yes, B_no and B_NA.

In [26]:
# !!!!!!!! temp solution !!!!!!!!!!!! manually dropping column v179. Need to find a generic solution
dataset_full_time_converted = dataset_full_time_converted.drop(['v179'], axis=1)

In [27]:
dataset_full_with_dummies = pd.get_dummies(dataset_full_time_converted, dummy_na=True)

**_2) Imputing remaining missing values._**

There are still many missing values. In order to use Machine Learning models in Task 2 and 3, I need to remove or impute missing values (NAs). In the previous parts I have removed some. The remaining will be imputed.

(a) Imputing missing values. I am taking a median value for each feature, as it is less biased than mean (outliers have a significant impact on mean, but not on median).

In [28]:
dataset_filled = dataset_full_with_dummies.fillna(dataset_full_with_dummies.median())

**_3) Determining the strongest predictors_**

To determine the strongest predictors logistic regression is a good ML-algorithm. I choose Regression, because it will expclicitly show which factors have more impact on the target and which less. I will normalize the data, so that the coefficients are comparable. I will use L2 regularization that allows to deal with collinearity and overfitting. 

(a) Normalizing the data

In [152]:
y = pd.DataFrame(dataset_filled[target].values)
predictors = get_col_names_without_target(dataset_filled)
scaler = StandardScaler()
scaler.fit(dataset_filled[predictors]) #normalizing only predictors
dataset_normalized = pd.DataFrame(scaler.transform(dataset_filled[predictors].values), columns=predictors)
dataset_normalized[target] = y #adding target back

(b) I will split the data to the train and test sets. Training set will be used to train and calibrate the model. Test set is used to assess the final model. Test set is 20% of the data and train set - 80%. Random state is set to 1, so that the data is split in the same manner every time I run the split function.

In [153]:
train, test = train_test_split(dataset_normalized, test_size=0.2, random_state=1)

(c) Creating a model. C (L2 regularization parameter) is set to 1, but will be calibrated later. I will fit intercept to have less biased coefficient. 

In [154]:
log_reg = LogisticRegression(penalty='l2', C=1, fit_intercept=True, random_state=1)

(d) Creating a model. C (L2 regularization parameter) is set to 1, but will be calibrated later. I will fit intercept to have less biased coefficient. 

In [155]:
#get columns names except the target
predictors = get_col_names_without_target(dataset_normalized, target='response')

In [156]:
scores = cross_val_score(log_reg, train[predictors], train[target], cv=5)

In [157]:
scores

array([ 0.79032258,  0.83064516,  0.76612903,  0.80487805,  0.83606557])

In [161]:
log_reg.fit(train[predictors], train[target])

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [162]:
log_reg.score(train[predictors], train[target])

0.82982171799027549

In [163]:
log_reg.score(test[predictors], test[target])

0.84516129032258069

In [164]:
log_reg.get_params()

{'C': 1,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'max_iter': 100,
 'multi_class': 'ovr',
 'n_jobs': 1,
 'penalty': 'l2',
 'random_state': 1,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [165]:
log_reg.coef_

array([[-0.05058131, -0.12786026,  0.48520098,  0.22643517, -0.37975305,
        -0.84479694,  0.29524418, -0.20732367, -0.14189013,  0.07926091,
         0.05775652, -0.0919219 ,  0.34680962, -0.22896688,  0.04180526,
         0.41227788, -0.01980765, -0.03827285,  0.19685987,  0.19685987,
         0.08734483,  0.08734483,  0.        ,  0.06908407, -0.10099104,
        -0.03310244,  0.03611464, -0.01761888,  0.0182894 , -0.03310244,
        -0.02088983,  0.06613705, -0.00544642,  0.008228  ,  0.0182894 ,
         0.10058898, -0.1024774 , -0.00544642, -0.03310244,  0.06613705,
         0.008228  ,  0.0182894 , -0.03310244,  0.00236505,  0.11470161,
        -0.0285687 ,  0.0182894 , -0.0182894 ,  0.0182894 ,  0.08734483,
        -0.03977454,  0.0182894 , -0.0182894 ,  0.0182894 , -0.01659483,
        -0.06908407,  0.02150538,  0.0182894 , -0.0182894 ,  0.0182894 ,
        -0.0182894 ,  0.0182894 ,  0.45887362,  0.22733104, -0.11850065,
         0.06613705,  0.02562817,  0.29388623, -0.4

In [168]:
coef_df = pd.DataFrame(log_reg.coef_, columns=predictors)
coef_df

Unnamed: 0,v001,v002,v4,v5,v14,v29,v120,v123,v173,v174,...,v204_mobile,v204_residential,v204_wifi,v204_wired,v204_nan,v172.1_N,v172.1_P,v172.1_U,v172.1_Y,v172.1_nan
0,-0.050581,-0.12786,0.485201,0.226435,-0.379753,-0.844797,0.295244,-0.207324,-0.14189,0.079261,...,0.025628,0.293886,-0.419102,0.145679,-0.208834,-0.21672,-0.283924,0.104011,-0.072263,0.018289


In [169]:
max(coef_df)

'v5'

In [166]:
log_reg.intercept_

array([-1.83642011])

In [112]:
train.shape

(617, 77)

In [113]:
test.shape

(155, 77)

In [70]:
len(predictors)

76

In [71]:
len(list(dataset_normalized))

77

In [101]:
np.abs(-1)

1

In [102]:
dataset_normalized.corr()['response'].apply(np.abs).sort_values(ascending=False)

response                                              1.000000
v191                                                  0.194025
v192                                                  0.194025
v29                                                   0.172579
v204_wifi                                             0.148913
v204_business                                         0.105443
v204_cellular                                         0.097367
v204_residential                                      0.077241
v002                                                  0.072353
v182                                                  0.066963
v120                                                  0.051613
v177                                                  0.048428
v172.1_N                                              0.045414
v174                                                  0.044762
v173                                                  0.042317
v195_Low                                              0

In [93]:
plt.matshow(dataset_filled.corr())

<matplotlib.image.AxesImage at 0x1ba141045f8>

In [None]:
take only text columns 

In [None]:
replace ',' with '.' in floats => use regex (*[N*int','M*int])

impute missing values!!!

normalize all data

one hot encoder / create dummies for text information

train regularized regression 

dimensionality reduction

## Task 4

1) Deal with imbalaced dataset. Out of 798 observations, response variable is 0 in 645 observations, and it is 1 in 153 cases. It is not a very big disbalance, but it is possible that prediction accuracy would be better if I would deal with this imbalancing. (a) The simplest approach is to randomly remove 492 rows where response variable is 0, this would result in a balaced dataset where we have 153 cases of response variable being 0 and 153 casee being 1. (b) A bit better approach would be to put more weight on obseravations where response is 1. Each such observation would weigh 4.2 (645/153). (c) Employ some of the many other approaches of dealing with imbalanced dataset.

2) Columns v173, v175 and v177 contain some date information. It would be good to understand what these dates are about and then to extract some valuable features. It could be: duration, starting and end time in hours, days, months, etc. Such information could be helpful at making better predictions.


3) I am mainly removing columns with many NAs. For rows I was more conservative - I was removing only those that had all NA values except for key columns. It might be beneficial to apply a threshold and remove rows that has too many missing values (similarly as I did with columns).

4) Use better techniques for dimensionality reduction

5) Use SVM for sparse datasets

6) Drop columns that has too few variations.

In [None]:
##############################################################################################################################

In [None]:
len(dataset_full_with_dummies.median())

In [64]:
def get_col_names_without_target(dataframe, target = target):
    column_names_list = list(dataframe.columns)
    if target in column_names_list:
        column_names_list.remove(target)
    return column_names_list

In [None]:
def get_col_names_without_target(dataframe, target = target):
    all_column_names_list = list(dataframe.columns)
    col_names_without_target = all_column_names_list.remove[target]
    return list(col_names_without_target)

In [65]:
mylist = ['a', 'b', 'c']
mylist

['a', 'b', 'c']

In [46]:
mylist.remove('a')
mylist

['b', 'c']

In [63]:
'a' in mylist

True

In [None]:
dataset_orange.domain

In [None]:
dataset_orange.save("output_dataset_orange.csv")

In [None]:
dataset_normalized

In [167]:
dataset_normalized.to_csv("output_dataset_normalized.csv")

In [133]:
dataset_full_not_cleaned_keys_dropped.to_csv('output_dataset_full_not_cleaned_keys_dropped.csv')

In [None]:
set(dataset_full_with_dummies.dtypes)

In [None]:
dataset_full_with_dummies_np[1]

In [None]:
dataset_full_np2

In [None]:
len(dataset_full_np2)

In [None]:
dataset_full_np2.shape

In [None]:
dataset_full_np2[0]

In [None]:
dataset_full_np.shape

In [None]:
len(dataset_full_np)

In [None]:
type(dataset_full_np)

In [None]:
dataset_full_np

In [None]:
dataset_full

In [None]:
dataset_0_and_1

In [None]:
dataset_full.loc[:5]

In [None]:
len(dataset_full)

In [None]:
dataset_full_with_dummies.loc[:5]

In [None]:
set(dataset_full_with_dummies.dtypes)

In [132]:
dataset_full_with_dummies.to_csv('output_dataset_full_with_dummies.csv')

In [None]:
dataset_full_time_converted.loc[5:]

In [None]:
dataset_full_time_converted.dtypes

In [None]:
dataset_full_time_converted.to_csv('output_dataset_full_time_converted.csv')

In [131]:
dataset_filled.to_csv('output_dataset_filled.csv')

In [None]:
dataset_0.shape[1]

In [None]:
dataset_example = dataset_1.copy(deep=True)
dataset_example.shape

In [None]:
#dataset_example = dataset_example.dropna(axis=0, how='all', subset=all_columns_no_key)

In [None]:
#dataset_example

In [None]:
list(dataset_0)

In [None]:
dataset_0

In [None]:
#set(dataset_1)

In [None]:
dataset_1

In [None]:
#list(dataset_2)

In [None]:
dataset_2

In [115]:
 df1 = pd.DataFrame({'A': ['yes', 'yes', 'no', 'maybe'],
                        'B': ['cat1', 'cat1', 'cat2', np.nan],
                        'C': [1.6, 5.3, 0.0, 7.3],
                        'D': [6, 3, 2, 2]},  index=[0, 1, 2, 3])
df1    

Unnamed: 0,A,B,C,D
0,yes,cat1,1.6,6
1,yes,cat1,5.3,3
2,no,cat2,0.0,2
3,maybe,,7.3,2


In [None]:
pd.get_dummies(df1, dummy_na=True)

In [122]:
df2 = pd.DataFrame({'A': [1, 3, 4, 5],
                        'B': [3.5, 6.6, 7.89, np.nan],
                        'C': [1.6, 5.3, 0.0, 7.3],
                        'D': [6, 3, np.nan, 2]},  index=[0, 1, 2, 3])
df2    

Unnamed: 0,A,B,C,D
0,1,3.5,1.6,6.0
1,3,6.6,5.3,3.0
2,4,7.89,0.0,
3,5,,7.3,2.0


In [None]:
df2_np = df2.values
df2_np

In [121]:
df3 = pd.DataFrame({'E': [4, 3, 3, 5]})
df3

Unnamed: 0,E
0,4
1,3
2,3
3,5


In [123]:
df2['e'] = df3
df2

Unnamed: 0,A,B,C,D,e
0,1,3.5,1.6,6.0,4
1,3,6.6,5.3,3.0,3
2,4,7.89,0.0,,3
3,5,,7.3,2.0,5


In [None]:
domain = Orange.data.Domain([size, height, shape], speed)

In [None]:
#df2_orange = Orange.data.Table(my_domain, df2_np)
#df2_orange

In [None]:
df2_orange.domain

In [None]:
type(df2_orange.domain)

In [None]:
set_np = np.array([[1, 2, 3], [5, 9.8, 14.7],
                    [2, 4, np.nan], [1, 2, 3.5], 
                    [1, 2, 3], [3, 6.1, 8.9],
                    [2, 4, 6], [3, 5.9, np.nan],
                    [1, 2, 3], [1, 1.8, 3.3]],)
set_np

In [None]:
set_pd = pd.DataFrame(set_np, columns = ['A', 'B', 'C'])
set_pd

In [None]:
set_filled=set_pd.copy(deep=True)
set_filled.fillna(set_pd.median())

In [None]:
set_pd.median()

In [None]:
orange_set = Orange.data.Table(set_np)
orange_set

In [None]:
from Orange.preprocess import Impute
imputer = Orange.preprocess.Impute.ModelConstructor()
imputer.learner_continuous = imputer.learner_discrete = Orange.classification.tree.TreeLearner(min_subset=20)
#imputer.learner_continuous = Orange.ensemble.forest.RandomForestLearner
imputer = imputer(orange_set)

In [None]:
from Orange.preprocess import Impute

In [147]:
na_df = pd.DataFrame([[1, 7, np.nan, np.nan, np.nan], [1, 7, np.nan, np.nan, np.nan],
                    [1, 2, 3, 4, 5], [3, 4, 5, 1, np.nan],
                    [6, 4, 5, np.nan, np.nan], [1, 2, np.nan, np.nan, np.nan], 
                    [1, 7, np.nan, np.nan, np.nan], [1, 7, np.nan, np.nan, np.nan],
                    [1, 7, np.nan, np.nan, np.nan], [1, 7, np.nan, np.nan, np.nan],
                    [1, 7, np.nan, np.nan, np.nan], [1, 7, np.nan, np.nan, np.nan]],
                    columns=['key1','A','B','C','D'])
na_df

Unnamed: 0,key1,A,B,C,D
0,1,7,,,
1,1,7,,,
2,1,2,3.0,4.0,5.0
3,3,4,5.0,1.0,
4,6,4,5.0,,
5,1,2,,,
6,1,7,,,
7,1,7,,,
8,1,7,,,
9,1,7,,,


In [148]:
na_df4 = na_df.copy(deep=True)
na_df4 = drop_rows_and_cols_with_NA_below_thresholds(na_df4, key_names=key_names, col_thresh=0.1, row_thresh=0.6).loc[:100]
na_df4

Unnamed: 0,key1,A,B,C,D
2,1,2,3.0,4.0,5.0
3,3,4,5.0,1.0,
4,6,4,5.0,,


In [None]:
na_df4 = na_df.copy(deep=True)
na_df4 = na_df4.dropna(axis=1, thresh=1) # droping NA columns
na_df4

In [None]:
na_df2=na_df.copy(deep=True)
na_df2 = na_df2.dropna(axis=0, how='all',subset={'B','C','A'})
na_df2

In [None]:
na_df_columns = list(na_df.columns)
na_df_columns

In [None]:
na_df_columns_set=set(na_df_columns)
na_df_columns_set

In [None]:
list(na_df_columns_set)

In [None]:
set(na_df2)

In [None]:
na_df3=na_df.copy(deep=True)
drop_NA_only_columns_and_rows(na_df3)

In [None]:
[1,2,3] - [1,2]

In [None]:
set([1,2,3]) - set([1,2,4])

In [None]:
#old drop NA function v1
def drop_NA_only_columns_and_rows(input_df, key_names=key_names):
    df = input_df.copy(deep=True)
    df_columns = set(df)
    df_columns_without_keys = df_columns - set(key_names)
    df = df.dropna(axis=0, how='all', subset=df_columns_without_keys) # droping rows that have all NA values except for keys
    df = df.dropna(axis=1, how='all') # droping NA columns
    return df

In [None]:
#old drop NA function v2
def drop_rows_with_NA_only_and_cols_with_NA_below_threshold(input_df, key_names=key_names, threshold_percent=0.20):
    df = input_df.copy(deep=True)
    
    df_columns = set(df)
    df_columns_without_keys = df_columns - set(key_names)
    df = df.dropna(axis=0, how='all', subset=df_columns_without_keys) # droping rows that have all NA values except for keys
    
    number_of_rows = len(df)
    threshold_integer = round(threshold_percent * number_of_rows)
    df = df.dropna(axis=1, thresh=threshold_integer) # droping columns that have non-NA cell count is below threshold
    return df

In [None]:
#old with regex
#function to replace ',' with '.'
#import re
def replace_commas_with_dots_in_string(single_string):
    if type(single_string) == str:
        ##df = input_dataframe.copy(deep=True)
        ##regex_input = '^[0-9]+,[0-9]+$'
        ##regex_output = '^[0-9]+\.[0-9]+$'
        ##single_string = re.sub('^[0-9]+,[0-9]+$', '^[0-9]+\.[0-9]+$', single_string)
        ##df.replace(to_replace=regex_input, value=regex_output, regex=True)
        single_string = single_string.replace(',','.')
    return single_string

In [None]:
dataset_full_clean

In [None]:
list(dataset_full_clean.columns)

In [None]:
col_names_no_key_no_target = get_col_names_without_target_and_keys(dataset_full_clean)

In [None]:
dataset_full_clean[col_names_no_key_no_target]

In [None]:
dataset_full_clean[col_names_no_key_no_target].shape

In [None]:
dataset_full_clean.shape

In [None]:
dataset_full_clean[col_names_no_key_no_target]

In [None]:
col_names_no_key_no_target

In [None]:
dataset_full_clean.dtypes

In [None]:
dataset_full_clean['v4']

In [None]:
dataset_full_clean['v12']

In [None]:
na_df = pd.DataFrame([[1.5, "2,5", 3, 4, 5], [3.5, 4.5, 5, 1, np.nan],
                    [6, "4,4", 5, "text with 4,5", "4,5 another text"], [1, 2, np.nan, np.nan, np.nan]],
                    columns=['key1','A','B','C','D'])
na_df

In [None]:
na_df = na_df.applymap(replace_commas_with_dots_in_string)
na_df

In [None]:
na_df.dtypes

In [None]:
na_df = na_df.applymap(convert_floats_in_string_to_floats)
na_df

In [None]:
na_df.dtypes

In [None]:
replace_commas_with_dots_in_df(na_df)

In [None]:
na_df.replace(",",".")

In [None]:
na_df

In [None]:
na_df.drop('B', axis=1)

In [None]:
na_df = na_df.drop(['A','B'], axis=1)
na_df

In [None]:
na_df

In [None]:
dataset_full.to_csv('output_dataset_full.csv')

In [None]:
dataset_full

In [None]:
dataset_full.dtypes

In [None]:
int(5.5)

In [None]:
dataset_full_not_cleaned

In [None]:
df_string_only = dataset_full.select_dtypes(include=['object']).copy(deep=True)

In [None]:
df_string_only.drop(['v173','v175','v177'], axis=1)

In [None]:
df_string_only = df_string_only.drop(['v173','v175','v177'], axis=1).copy(deep=True)

In [None]:
df_string_only

In [None]:
pd.get_dummies(df_string_only)

In [None]:
df1 = pd.DataFrame([['A', "B", 'YES', 4, 5], ['A', "B", 'YES', 1, np.nan],
                    ['C', "D", 'NO', 5, 4], ['C',np.nan, 'NO', np.nan, np.nan]],
                    columns=['col1','col2','col3','col4','col5'])
df1

In [None]:
pd.get_dummies(df1)

In [None]:
pd.get_dummies(df1, dummy_na=True)

In [None]:
pd.core.dtypes.common.is_datetime_or_timedelta_dtype(dates)

In [None]:
pd.core.dtypes.common.is_datetime64_ns_dtype(dates)|pd.core.dtypes.common.is_timedelta64_ns_dtype(dates)

In [None]:
pd.to_datetime(dates, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix')

In [None]:
fake_dates = dataset_full['v197'].copy(deep=True)

In [None]:
fake_dates.loc[:10]

In [None]:
#old
def get_col_names_without_target_and_keys(dataframe, key_names = key_names, target = target):
    all_column_names_set = set(dataframe)
    col_names_without_target_and_keys = all_column_names_set - set(key_names) - set([target])
    return list(col_names_without_target_and_keys)

In [None]:
import re
def check_if_date(element):
    if type(element) == str:
        ##df = input_dataframe.copy(deep=True)
        ##regex_input = '^[0-9]+,[0-9]+$'
        ##regex_output = '^[0-9]+\.[0-9]+$'
        ##single_string = re.sub('^[0-9]+,[0-9]+$', '^[0-9]+\.[0-9]+$', single_string)
        ##df.replace(to_replace=regex_input, value=regex_output, regex=True)
        single_string = single_string.replace(',','.')
    return single_string

In [None]:
#old

(d) Python uses '.' as a decimal point. However, in datasets sometimes we get ',' as a decimal point. Need to replace ',' with '.'. After this is done, will need to convert floats stored as string to Python floats. Integer columns will also be converted to floats.

In [None]:
def replace_commas_with_dots_in_string(single_string):
    if type(single_string) == str:
        single_string = single_string.replace(',','.')
    return single_string

In [None]:
# applying a function on each cell of a dataframe
dataset_full = dataset_full.applymap(replace_commas_with_dots_in_string)

In [None]:
#function to convert floats stored as string to floats
def convert_floats_in_string_to_floats(element):
    if type(element) == str or type(element) == int:
        try:
            return float(element)
        except (ValueError, TypeError):
            return element
    return element

We do not want to convert key and response columns to float, so need to obtain a list of all columns except for keys and target.

In [None]:
columns_to_convert = get_col_names_without_target_and_keys(dataset_full)

Converting all columns except for keys and response.

In [None]:
dataset_full.loc[:,columns_to_convert] = (dataset_full[columns_to_convert]).applymap(convert_floats_in_string_to_floats)

In [None]:
pd.to_datetime(dataset_full_copy, format='%Y-%m-%d %H:%M', errors='ignore')

In [None]:
pd.to_datetime(500, format='%Y-%m-%d %H:%M', errors='ignore')

In [None]:
pd.to_datetime(date, format='%Y-%m-%d %H:%M', errors='ignore').loc[:5]

In [None]:
date = dataset_full['v173'].copy(deep=True)
date.loc[:5]

In [None]:
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In [None]:
#old
def col_to_datetime(input_col):
    if input_col.dtype=='O':
        col=pd.to_datetime(input_col, format='%Y-%m-%d %H:%M', errors='ignore')
        return col
    else:
        return input_col    

In [None]:
import datetime as dt

def col_to_datetime(input_col):
    if input_col.dtype=='O': #in pandas dataframe columns containing Strings, has type Object, or 'O'
        col_datetime=pd.to_datetime(input_col, format='%Y-%m-%d %H:%M', errors='ignore') #convert to datetime only if format is '%Y-%m-%d %H:%M'
        if col_datetime.dtype=='datetime64[ns]': 
            epoch_timestamp_col = col_datetime - dt.datetime(1970, 1, 1)
            sec_float_col = epoch_timestamp_col / np.timedelta64(1, 's')
            return sec_float_col
        return col_datetime
    else:
        return input_col           

In [None]:
dates.loc[:5]

In [None]:
dates2=dates.apply(col_to_datetime)
dates2.loc[:5]

In [None]:
type(dates2)

In [None]:
dates2.dtypes

In [None]:
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
dates3=dataset_full_copy.apply(col_to_datetime)
dates3.loc[:5]

In [None]:
dates3.dtypes

In [None]:
dataset_full_copy

In [None]:
dataset_full_copy.dtypes

In [None]:
dates = dataset_full[['v173','v175','v177']].copy(deep=True)
dates.loc[:5]

In [None]:
type(dates)

In [None]:
dataset_full_copy = dataset_full.copy(deep=True)

In [None]:
#old - removed because Orange does not have imputation library

(a) In Pandas, Numpy and Scikit learn packages there is no possibility to impute missing values with machine learning algorithms (e.g. to predict value). For that I would need to use Orange package. But to use that package I would need to tranform dataframes from Pandas to Orange.

_Note: Both Pandas and Orange dataframes are just wrapers for NumPay, so doing this transformation is not computationally expensive._

In [None]:
# 2 functions to convert Pandas dataframe to Orange table/dataframe
def get_feature_description_for_orange_from_pandas(pandas_df):
    feature_list = [Orange.data.ContinuousVariable(col) for col in list(pandas_df.columns)]
    return Domain(feature_list)

def pandas_to_orange_df(pandas_df):
    np_array = pandas_df.values
    orange_table_domain = get_feature_description_for_orange_from_pandas(pandas_df)
    orange_table = Orange.data.Table(orange_table_domain, np_array)
    return orange_table

In [None]:
dataset_orange = pandas_to_orange_df(dataset_full_with_dummies)