### Contents

* [Reading dataset](#Reading-dataset)
    * [Inspecting the rows of data](#Inspecting-the-rows-of-data)
    * [Number of rows and columns](#Number-of-rows-and-columns)
    * [Dataset Info](#Dataset-Info)
    
    
* [Cleaning phase](#Cleaning-phase)
    
    * [Converting column names to lower case](#Converting-column-names-to-lower-case)
    * [Fixing data type of columns](#Fixing-data-type-of-columns)
    * [Unique value count](#Unique-value-count)
    * [Discarding unnecessary columns](#Discarding-unnecessary-columns)
    

* [Dumping cleaned data](#Dumping-cleaned-data)

In [1]:
from utils import *

import os
import pandas as pd

pd.set_option('display.max_columns', 50)

## Reading dataset

In [2]:
df_train = pd.read_csv('data/source/train.csv')

### Inspecting the rows of data

In [3]:
display(df_train.head())

Unnamed: 0,Deal_title,Lead_name,Industry,Deal_value,Weighted_amount,Date_of_creation,Pitch,Contact_no,Lead_revenue,Fund_category,Geography,Location,POC_name,Designation,Lead_POC_email,Hiring_candidate_role,Lead_source,Level_of_meeting,Last_lead_update,Internal_POC,Resource,Internal_rating,Success_probability
0,TitleM5DZY,"Davis, Perkins and Bishop Inc",Restaurants,320506$,2067263.7$,2020-03-29,Product_2,607.447.7883,50 - 100 Million,Category 2,USA,"Killeen-Temple, TX",Charlene Werner,Executive Vice President,charlenewerner@davis.com,Community pharmacist,Website,Level 3,No track,"Davis,Sharrice A",,3,73.6
1,TitleKIW18,Bender PLC LLC,Construction Services,39488$,240876.8$,2019-07-10,Product_2,892-938-9493,500 Million - 1 Billion,Category 4,India,Ratlam,rakhi,Chairman/CEO/President,terrylogan@bender.com,Recruitment consultant,Others,Level 1,Did not hear back after Level 1,"Brown,Maxine A",No,5,58.9
2,TitleFXSDN,Carter-Henry and Sons,Hospitals/Clinics,359392$,2407926.4$,2019-07-27,Product_1,538.748.2271,500 Million - 1 Billion,Category 4,USA,"Albany-Schenectady-Troy, NY",Ariel Hamilton,SVP/General Counsel,arielhamilton@carterhenry.com,Health service manager,Marketing Event,Level 1,?,"Georgakopoulos,Vasilios T",No,4,68.8
3,TitlePSK4Y,Garcia Ltd Ltd,Real Estate,76774$,468321.4$,2021-01-30,Product_2,(692)052-1389x75188,500 Million - 1 Billion,Category 3,USA,"Mount Vernon-Anacortes, WA",Erin Wilson,CEO/Co-Founder/Chairman,erinwilson@garcia.com,"Therapist, speech and language",Contact Email,Level 2,Did not hear back after Level 1,"Brown,Maxine A",We have all the requirements,1,64.5
4,Title904GV,Lee and Sons PLC,Financial Services,483896$,,2019-05-22,Product_2,001-878-814-6134x015,50 - 100 Million,Category 3,India,Shimoga,kavita,Executive Vice President,mr.christopher@lee.com,Media planner,Website,Level 2,Up-to-date,"Thomas,Lori E",No,4,62.4


### Number of rows and columns

In [4]:
df_train.shape

(7007, 23)

### Dataset Info

In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7007 entries, 0 to 7006
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Deal_title             7007 non-null   object 
 1   Lead_name              7007 non-null   object 
 2   Industry               7006 non-null   object 
 3   Deal_value             6956 non-null   object 
 4   Weighted_amount        6482 non-null   object 
 5   Date_of_creation       7007 non-null   object 
 6   Pitch                  7007 non-null   object 
 7   Contact_no             7007 non-null   object 
 8   Lead_revenue           7007 non-null   object 
 9   Fund_category          7007 non-null   object 
 10  Geography              6035 non-null   object 
 11  Location               6996 non-null   object 
 12  POC_name               6999 non-null   object 
 13  Designation            7007 non-null   object 
 14  Lead_POC_email         7007 non-null   object 
 15  Hiri

#### Observation:

* The name of columns should be changed to lower case for increasing the speed of analysis
* Columns `Deal_value`, `Weighted_amount` and `Date_of_creation` have incorrect data types
* Some columns have missing values also. Missing values will be inspected in the EDA phase.

## Cleaning phase

### Converting column names to lower case

In [6]:
df_train.columns = [col.lower() for col in df_train.columns.tolist()]

### Fixing data type of columns

In [7]:
df_train['deal_value'] = df_train['deal_value'].str.strip('$').astype(float)
df_train['weighted_amount'] = df_train['weighted_amount'].str.strip('$').astype(float)
df_train['date_of_creation'] = pd.to_datetime(df_train['date_of_creation'])

### Unique value count

In [8]:
df_train_nunique = unique_value_stats(df_train)

display(df_train_nunique)

Unnamed: 0,column,unique_value_count
0,deal_title,7007
1,lead_name,7007
2,industry,171
3,deal_value,6907
4,weighted_amount,6480
5,date_of_creation,777
6,pitch,2
7,contact_no,7007
8,lead_revenue,3
9,fund_category,4


### Discarding unnecessary columns

In [9]:
cols_to_discard = discard_columns(df_train, df_train_nunique)

cols_to_discard

['deal_title', 'lead_name', 'contact_no', 'lead_poc_email']

In [10]:
df_train = df_train.drop(columns=cols_to_discard)
    
display(df_train.head())

Unnamed: 0,industry,deal_value,weighted_amount,date_of_creation,pitch,lead_revenue,fund_category,geography,location,poc_name,designation,hiring_candidate_role,lead_source,level_of_meeting,last_lead_update,internal_poc,resource,internal_rating,success_probability
0,Restaurants,320506.0,2067263.7,2020-03-29,Product_2,50 - 100 Million,Category 2,USA,"Killeen-Temple, TX",Charlene Werner,Executive Vice President,Community pharmacist,Website,Level 3,No track,"Davis,Sharrice A",,3,73.6
1,Construction Services,39488.0,240876.8,2019-07-10,Product_2,500 Million - 1 Billion,Category 4,India,Ratlam,rakhi,Chairman/CEO/President,Recruitment consultant,Others,Level 1,Did not hear back after Level 1,"Brown,Maxine A",No,5,58.9
2,Hospitals/Clinics,359392.0,2407926.4,2019-07-27,Product_1,500 Million - 1 Billion,Category 4,USA,"Albany-Schenectady-Troy, NY",Ariel Hamilton,SVP/General Counsel,Health service manager,Marketing Event,Level 1,?,"Georgakopoulos,Vasilios T",No,4,68.8
3,Real Estate,76774.0,468321.4,2021-01-30,Product_2,500 Million - 1 Billion,Category 3,USA,"Mount Vernon-Anacortes, WA",Erin Wilson,CEO/Co-Founder/Chairman,"Therapist, speech and language",Contact Email,Level 2,Did not hear back after Level 1,"Brown,Maxine A",We have all the requirements,1,64.5
4,Financial Services,483896.0,,2019-05-22,Product_2,50 - 100 Million,Category 3,India,Shimoga,kavita,Executive Vice President,Media planner,Website,Level 2,Up-to-date,"Thomas,Lori E",No,4,62.4


## Dumping cleaned data

In [11]:
df_train.to_csv('data/cleaned/train.csv', index=False)