In [1]:
from utils import *
import pandas as pd

pd.set_option('display.max_columns', 50)

## Reading dataset

In [2]:
df_train = pd.read_csv('data/cleaned/train.csv', 
                       parse_dates=['date_of_creation'])

display(df_train.head())

Unnamed: 0,industry,deal_value,weighted_amount,date_of_creation,pitch,lead_revenue,fund_category,geography,location,poc_name,designation,hiring_candidate_role,lead_source,level_of_meeting,last_lead_update,internal_poc,resource,internal_rating,success_probability
0,Restaurants,320506.0,2067263.7,2020-03-29,Product_2,50 - 100 Million,Category 2,USA,"Killeen-Temple, TX",Charlene Werner,Executive Vice President,Community pharmacist,Website,Level 3,No track,"Davis,Sharrice A",,3,73.6
1,Construction Services,39488.0,240876.8,2019-07-10,Product_2,500 Million - 1 Billion,Category 4,India,Ratlam,rakhi,Chairman/CEO/President,Recruitment consultant,Others,Level 1,Did not hear back after Level 1,"Brown,Maxine A",No,5,58.9
2,Hospitals/Clinics,359392.0,2407926.4,2019-07-27,Product_1,500 Million - 1 Billion,Category 4,USA,"Albany-Schenectady-Troy, NY",Ariel Hamilton,SVP/General Counsel,Health service manager,Marketing Event,Level 1,?,"Georgakopoulos,Vasilios T",No,4,68.8
3,Real Estate,76774.0,468321.4,2021-01-30,Product_2,500 Million - 1 Billion,Category 3,USA,"Mount Vernon-Anacortes, WA",Erin Wilson,CEO/Co-Founder/Chairman,"Therapist, speech and language",Contact Email,Level 2,Did not hear back after Level 1,"Brown,Maxine A",We have all the requirements,1,64.5
4,Financial Services,483896.0,,2019-05-22,Product_2,50 - 100 Million,Category 3,India,Shimoga,kavita,Executive Vice President,Media planner,Website,Level 2,Up-to-date,"Thomas,Lori E",No,4,62.4


### Sorting data by `date_of_creation`

In [3]:
df_train = df_train.sort_values(by='date_of_creation')
df_train.reset_index(drop=True, inplace=True)

#### Inspecting head and tail of the sorted data

In [4]:
df_train.head()

Unnamed: 0,industry,deal_value,weighted_amount,date_of_creation,pitch,lead_revenue,fund_category,geography,location,poc_name,designation,hiring_candidate_role,lead_source,level_of_meeting,last_lead_update,internal_poc,resource,internal_rating,success_probability
0,Technology Consulting,326636.0,1959816.0,2019-01-01,Product_1,100 - 500 Million,Category 3,India,Brahmapur,gaytri bai,Chairman/Chief Innovation Officer,"Editor, commissioning",Others,Level 2,more than a month,"Dyson,William A",Deliverable,5,62.7
1,Other,64710.0,423850.5,2019-01-01,Product_1,50 - 100 Million,Category 4,India,Kavaratti,kamini,SVP/General Counsel,Community development worker,Marketing Event,Level 2,Pending,"Massiah,Gerard F",We have all the requirements,5,61.7
2,Hospitals/Clinics,170420.0,1073646.0,2019-01-01,Product_1,500 Million - 1 Billion,Category 1,India,Muzaffarpur,puja,Vice President / GM (04-present) : VP Sales an...,Hydrologist,Others,Level 2,Pending,"Sutton,Michelle R",No,3,107.34
3,Banks,480392.0,2906371.6,2019-01-01,Product_1,500 Million - 1 Billion,Category 3,India,Kavaratti,poonam,CEO/Co-Founder/Chairman,Office manager,Others,Level 3,More than a week back,"Van Arter,Derrick",No,1,61.5
4,Banks,261734.0,1583490.7,2019-01-01,Product_1,50 - 100 Million,Category 3,India,Mahabubnagar,alisha loomba,Chairman/Chief Innovation Officer,Manufacturing engineer,Marketing Event,Level 1,Did not hear back after Level 1,"Charles,Caleb",No,5,26.35


In [5]:
df_train.tail()

Unnamed: 0,industry,deal_value,weighted_amount,date_of_creation,pitch,lead_revenue,fund_category,geography,location,poc_name,designation,hiring_candidate_role,lead_source,level_of_meeting,last_lead_update,internal_poc,resource,internal_rating,success_probability
7002,Materials/Manufacturing,45251.0,257930.7,2021-02-15,Product_1,50 - 100 Million,Category 2,India,Cuttack,khushboo khan,CEO/Co-Founder/Chairman,Medical technical officer,Others,Level 3,Pending,"Gaskins Jr,Franklin D",Yes,4,61.3
7003,Financial Services,52662.0,284374.8,2021-02-15,Product_1,100 - 500 Million,Category 2,USA,"Greenville-Anderson-Mauldin, SC",Chad Brown,SVP/General Counsel,English as a second language teacher,Others,Level 1,Did not hear back after Level 1,"Green,Candy",Yes,1,62.2
7004,Restaurants,326727.0,2025707.4,2021-02-15,Product_2,50 - 100 Million,Category 3,India,Tezpur,ambika,CEO,"Radiographer, diagnostic",Contact Email,Level 3,More than a week back,"Heidelberg,Andre D",Deliverable,4,61.5
7005,Conglomerates,487253.0,3142781.85,2021-02-15,Product_1,500 Million - 1 Billion,Category 3,India,Vishakhapatnam,deeksha,Chairman/Chief Innovation Officer,Commercial/residential surveyor,Marketing Event,Level 3,More than a week back,"Ryker,David",Not enough,1,63.5
7006,Recreational Products,167888.0,990539.2,2021-02-15,Product_1,500 Million - 1 Billion,Category 3,,Ajmer,rakhi soni,CEO,Bonds trader,Others,Level 2,More than 2 weeks,"Heidelberg,Andre D",Cannot deliver,4,62.0


## Unique value stats for summarizing the data

In [6]:
df_train_nunique = unique_value_stats(df_train)

display(df_train_nunique)

Unnamed: 0,column,unique_value_count
0,industry,171
1,deal_value,6907
2,weighted_amount,6480
3,date_of_creation,777
4,pitch,2
5,lead_revenue,3
6,fund_category,4
7,geography,2
8,location,597
9,poc_name,5261


### Summarizing training dataset:

* The data is collected from 01 Jan 2019 to 15 Feb 2021


* There are 171 industries in the data


* Lead was generated for two products i.e. `Product_1` and `Product_2`


* The revenue of the lead's organization can lie in the following range:

```python
['50 - 100 Million', '100 - 500 Million', '500 Million - 1 Billion']
```
    
* Funding category can be `['Category 1', 'Category 2', 'Category 3', 'Category 4']`


* Leads belong to two countries i.e. `USA` and `India`


* Designation has the following values:

```python
['Chairman/Chief Innovation Officer', 'SVP/General Counsel ',
 'Vice President / GM (04-present) : VP Sales and Marketing (01-04)',
 'CEO/Co-Founder/Chairman', 'CEO/Chairman/President',
 'Chairman/CEO/President', 'CEO', 'CEO/President',
 'Chief Executive Officer', 'Executive Vice President'
]
```

Looks like `designation` column has some ambiguities such as `CEO` or `Chief Executive Officer`


* The lead can come from the following sources:

```python
['Contact Email', 'Website', 'Marketing Event', 'Others']
```

* The level of meeting is defined [here](https://github.com/sank3t/Reduce-Marketing-Spend#call-level)


* The lead update has following values:

```python
['more than a month',
 'Pending',
 'More than a week back',
 'Did not hear back after Level 1',
 '?',
 'Up-to-date',
 '2 days back',
 '5 days back',
 'No track',
 'Following up but lead not responding',
 'More than 2 weeks'
 ]
 ```
 
 * The availability of resource can have following values:
 
```python
 ['Deliverable', 'We have all the requirements', 
  'No', 'Not enough',
  'Cannot deliver', 'Yes'
 ]
```

* The lead is rated on a scale of **1 to 5**

#### Interesting fact 🙂

The leads from India have their `poc_name` in lower case.

## Data Preparation

### Fixing the ambiguous values in `designation` column

In [7]:
replace_designation_dict = {
    'Chairman/Chief Innovation Officer': 'CEO',
    'CEO/Co-Founder/Chairman': 'CEO',
    'CEO/Chairman/President': 'CEO',
    'Chairman/CEO/President': 'CEO',
    'CEO/President': 'CEO',
    'Chief Executive Officer': 'CEO',
    'Vice President / GM (04-present) : VP Sales and Marketing (01-04)': 'SVP/General Counsel',
    'Executive Vice President': 'SVP/General Counsel'
}

df_train = df_train.replace({'designation': replace_designation_dict})
df_train['designation'] = df_train['designation'].str.strip()

### Fixing the ambiguous values in `resource` column

In [8]:
replace_resource_dict = {
    'Deliverable': 'Yes',
    'We have all the requirements': 'Yes',
    'Not enough': 'No',
    'Cannot deliver': 'No'
}

df_train = df_train.replace({'resource': replace_resource_dict})
df_train['resource'] = df_train['resource'].str.strip()

In [9]:
df_train.to_csv('data/transformed/train.csv', index=False)