# Exploratory Data Analysis - Usage Dataset

---
### <i>Changelogs:</i>

  Name  |  Date  |   Description
- **Kiet Vu**  |  03/17  | Create notebook. Minor Editing. Create "Data Understanding" section.

---

## Table of Contents
**Each phase of the process:**
1. [Data Understanding](#Dataunderstanding)
    1. [Initial Data Report](#Datareport)
    2. [Describe Data](#Describedata)
    3. [Verify Data Quality](#Verifydataquality)
        1. [Missing Data](#MissingData) 
        2. [Outliers](#Outliers)
    4. [Initial Data Exploration](#Exploredata)
    5. [Data Quality Report](#Dataqualityreport)
2. [Data Preparation](#Datapreparation)
    1. [Select Your Data](#Selectyourdata)
    2. [Cleanse the Data](#Cleansethedata)
        1. [Label Encoding](#labelEncoding)
        2. [Drop Unnecessary Columns](#DropCols)
        3. [Altering Data Types](#AlteringDatatypes)
        4. [Dealing With Zeros](#DealingZeros)
        5. [Dealing With Duplicates](#DealingDuplicates)
        4. [Remove Outliers](#RemoveOutliers)
    3. [Construct Required Data](#Constructrequireddata)
    4. [Integrate Data](#Integratedata)
3. [Exploratory Data Analysis](#EDA)
4. [Modelling](#Modelling)
5. [Evaluation](#Evaluation)
6. [Deployment](#Deployment)

If you want to learn more about CRISP-DM, please refer to this link: https://www.sv-europe.com/crisp-dm-methodology/

In [1]:
# Import Libraries Required
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
#import folium
#from folium import plugins
#!pip install --upgrade geopandas
#import geopandas

In [2]:
# Import the orginal dataset
df = pd.read_csv('Clean Data/usage_clean_20230314.csv',encoding_errors='ignore')
df = df.copy()
# first 30 records
df.head(30)

Unnamed: 0,unique_identifier,usage,status
0,0001230a214b39e0e5c463bfe440fb15,81440.0,FINALLED
1,000345e997e72b61b990d2689c76427f,556.3,ACTIVE
2,0003c4d7aeb24f319f0d7c6ddb60bb8f,32564.0,FINALLED
3,00082675e86a9f3cf5fdcc5d4cd9114d,5519.0,FINALLED
4,00095201031df44962513f378842d521,5946.0,ACTIVE
5,000a04481ee5acbb856a7c485a67423a,75468.0,FINALLED
6,000bee0b537b676a975a15999776581f,88280.0,FINALLED
7,000c88d34beda722f7b559bb056b7809,109258.0,ACTIVE
8,000f645a52095f72ec723133e2b0092c,9686.0,FINALLED
9,00109796f3c34d87f1ff2778498a8016,65700.0,FINALLED


---
## 1. Data Understanding <a class="anchor" id="Dataunderstanding"></a>

### 1.2 Describe Data <a class="anchor" id="Describedata"></a>

In [3]:
df.dtypes

unique_identifier     object
usage                float64
status                object
dtype: object

In [4]:
df.columns

Index(['unique_identifier', 'usage', 'status'], dtype='object')

In [5]:
df.size

152658

In [6]:
df.shape

(50886, 3)

In [7]:
df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50886 entries, 0 to 50885
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   unique_identifier  50886 non-null  object 
 1   usage              50886 non-null  float64
 2   status             50886 non-null  object 
dtypes: float64(1), object(2)
memory usage: 1.2+ MB


In [8]:
df.describe()

Unnamed: 0,usage
count,50886.0
mean,72705.79
std,367530.1
min,0.1
25%,9922.0
50%,24840.0
75%,60286.5
max,41709170.0
