# Exploratory Data Analysis - Firmographic Dataset

---
### <i>Changelogs:</i>

  Name  |  Date  |   Description
- **Kiet Vu**  |  03/17  | Create notebook. Minor Editing. Create "Data Understanding" section.

---

## Table of Contents
**Each phase of the process:**
1. [Data Understanding](#Dataunderstanding)
    1. [Initial Data Report](#Datareport)
    2. [Describe Data](#Describedata)  
    3. [Verify Data Quality](#Verifydataquality)
        1. [Missing Data](#MissingData) 
        2. [Outliers](#Outliers)
    4. [Initial Data Exploration](#Exploredata)
    5. [Data Quality Report](#Dataqualityreport)
2. [Data Preparation](#Datapreparation)
    1. [Select Your Data](#Selectyourdata)
    2. [Cleanse the Data](#Cleansethedata)
        1. [Label Encoding](#labelEncoding)
        2. [Drop Unnecessary Columns](#DropCols)
        3. [Altering Data Types](#AlteringDatatypes)
        4. [Dealing With Zeros](#DealingZeros)
        5. [Dealing With Duplicates](#DealingDuplicates)
        4. [Remove Outliers](#RemoveOutliers)
    3. [Construct Required Data](#Constructrequireddata)
    4. [Integrate Data](#Integratedata)
3. [Exploratory Data Analysis](#EDA)
4. [Modelling](#Modelling)
5. [Evaluation](#Evaluation)
6. [Deployment](#Deployment)

If you want to learn more about CRISP-DM, please refer to this link: https://www.sv-europe.com/crisp-dm-methodology/

---

In [1]:
# Import Libraries Required
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
#import folium
#from folium import plugins
#!pip install --upgrade geopandas
#import geopandas

In [2]:
# Import the orginal dataset
df = pd.read_csv('Clean Data/firmographic_clean_20230314.csv', low_memory=False)
df = df.copy()
# first 30 records
df["B2B_ADDR_ZIP5"]=df["B2B_ADDR_ZIP5"].values.astype('str') #Seems that zipcode is wrong type, clean again!
df.head(30)

Unnamed: 0,unique_identifier,B2B_ADDR_ZIP5,B2B_ADDR_STATE,B2B_ACCEPT_CREDIT_CARD_FLAG,B2B_ACCOUNTING_EXPENSE_CODE,B2B_AD_SIZE,B2B_ADVERTISING_EXPENSE_CODE,B2B_ASSET_FLAG,B2B_BANKRUPTCY_DATE,B2B_BANKRUPTCY_FLAG,...,B2B_TOT_SALES_VOLUME,B2B_TRANSACTION_CODE,B2B_TRANSACTION_TYPE,B2B_TRUE_FRNCHSE_FLAG,B2B_UTILITY_CODE,B2B_WEALTH_FLAG,B2B_WHITE_COLLAR_FLAG,B2B_WHITE_COLLAR_PRCNT,B2B_YEAR_SIC_ADD,CAC_SEGMENT
0,0001230a214b39e0e5c463bfe440fb15,44240.0,OH,,C,A,A,,,,...,,,,2.0,D,,,25.0,201801.0,Manufacturing
1,000345e997e72b61b990d2689c76427f,15218.0,PA,,D,A,C,,,,...,,A,,,B,,1.0,99.0,198405.0,Business and Finance
2,0003c4d7aeb24f319f0d7c6ddb60bb8f,44067.0,OH,ADMV,C,B,B,,,,...,,,,2.0,A,1.0,,16.0,201909.0,Personal Services
3,00082675e86a9f3cf5fdcc5d4cd9114d,60618.0,IL,,D,,B,,,,...,,,,2.0,C,,,21.0,199104.0,Blue Collar Work
4,00095201031df44962513f378842d521,61111.0,IL,DMV,A,A,A,,,,...,,,,2.0,A,,1.0,83.0,201303.0,General Merchandise
5,000a04481ee5acbb856a7c485a67423a,62526.0,IL,,E,A,D,,,,...,,,,2.0,D,,1.0,50.0,201902.0,Personal Services
6,000a1fe8f9d0caf306b805de359b6947,14472.0,NY,,G,B,F,,,,...,,,,2.0,D,,1.0,54.0,202208.0,Wholesale
7,000bee0b537b676a975a15999776581f,2110.0,MA,,C,B,C,,,,...,,,,1.0,A,,,10.0,202011.0,Food and Dining
8,000c88d34beda722f7b559bb056b7809,78064.0,TX,ADMV,A,A,A,,,,...,,,,2.0,A,,,29.0,201303.0,Hotels and Educational Boarding
9,000cc270c1cc3f09a4a80c2489ce4bac,8401.0,NJ,ADMV,E,A,E,,,,...,,,,2.0,B,,,10.0,200309.0,Food and Dining


---
## 1. Data Understanding <a class="anchor" id="Dataunderstanding"></a>

### 1.2 Describe Data <a class="anchor" id="Describedata"></a>

In [3]:
df.dtypes

unique_identifier               object
B2B_ADDR_ZIP5                   object
B2B_ADDR_STATE                  object
B2B_ACCEPT_CREDIT_CARD_FLAG     object
B2B_ACCOUNTING_EXPENSE_CODE     object
                                ...   
B2B_WEALTH_FLAG                float64
B2B_WHITE_COLLAR_FLAG          float64
B2B_WHITE_COLLAR_PRCNT         float64
B2B_YEAR_SIC_ADD               float64
CAC_SEGMENT                     object
Length: 101, dtype: object

In [4]:
df.columns

Index(['unique_identifier', 'B2B_ADDR_ZIP5', 'B2B_ADDR_STATE',
       'B2B_ACCEPT_CREDIT_CARD_FLAG', 'B2B_ACCOUNTING_EXPENSE_CODE',
       'B2B_AD_SIZE', 'B2B_ADVERTISING_EXPENSE_CODE', 'B2B_ASSET_FLAG',
       'B2B_BANKRUPTCY_DATE', 'B2B_BANKRUPTCY_FLAG',
       ...
       'B2B_TOT_SALES_VOLUME', 'B2B_TRANSACTION_CODE', 'B2B_TRANSACTION_TYPE',
       'B2B_TRUE_FRNCHSE_FLAG', 'B2B_UTILITY_CODE', 'B2B_WEALTH_FLAG',
       'B2B_WHITE_COLLAR_FLAG', 'B2B_WHITE_COLLAR_PRCNT', 'B2B_YEAR_SIC_ADD',
       'CAC_SEGMENT'],
      dtype='object', length=101)

In [5]:
df.size

6048486

In [6]:
df.shape

(59886, 101)

In [7]:
df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59886 entries, 0 to 59885
Data columns (total 101 columns):
 #    Column                           Dtype  
---   ------                           -----  
 0    unique_identifier                object 
 1    B2B_ADDR_ZIP5                    object 
 2    B2B_ADDR_STATE                   object 
 3    B2B_ACCEPT_CREDIT_CARD_FLAG      object 
 4    B2B_ACCOUNTING_EXPENSE_CODE      object 
 5    B2B_AD_SIZE                      object 
 6    B2B_ADVERTISING_EXPENSE_CODE     object 
 7    B2B_ASSET_FLAG                   float64
 8    B2B_BANKRUPTCY_DATE              object 
 9    B2B_BANKRUPTCY_FLAG              object 
 10   B2B_BIG_BUSINESS_INDICATOR       object 
 11   B2B_BUSINESS_DESCRIP_FLAG        object 
 12   B2B_BUSINESS_GROW_FLAG           object 
 13   B2B_BUSINESS_SQUARE_FOOT_NUM     object 
 14   B2B_BUSINESS_STATUS_CODE         float64
 15   B2B_CALL_STATUS                  object 
 16   B2B_COMPUTER_EXPENSE_CODE        objec