<a href="https://colab.research.google.com/github/rc-dbe/bigdatacertification/blob/master/Data_Analytics_Essentials_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


*Hands-on of Big Data Analyst with TuV Certified Qualification*


---




# **1. Data Analytics Essentials**

**Overview:** 

* Data Exploration
* Data Preparation
* Dimensionality Reduction

## **Data Exploration**

Data exploration is the initial step in data analysis, where users explore a large data set in an unstructured way to uncover initial patterns, characteristics, and points of interest. This process isn’t meant to reveal every bit of information a dataset holds, but rather to help create a broad picture of important trends and major points to study in greater detail.

### **Import Data**

There are so many ways to import the data from the outside source. One of it is using Pandas, Pandas is a python library that widely used for data analytics.

In [None]:
# import library
import pandas as pd

In practice this time we will use Telco Customer Churn dataset that available at Kaggle. This dataset contains unique customer records for a telecom company called Telco.

This data set includes information about:
*   Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
*   Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges.
*   Demographic info about customers – gender, age range, and if they have partners and dependents.
*   Customers who left within the last month – the column is called Churn

We can import many types of file, from many types of source. Since we are using Goole Collab, we are going to import the dataset from an online link(GitHUb Repo). For the other methods to import data can be seen in the Pandas documentation. [HERE](https://https://pandas.pydata.org/pandas-docs/stable/). 



Import Data

In [None]:
# Import data from an online link (GitHub)
df_link = pd.read_csv('https://raw.githubusercontent.com/rc-dbe/bigdatacertification/master/dataset/churn.csv', sep=';',)

In [None]:
# Select the dataset
df = df_link

Show Data

In [None]:
# Prints 10 first Row
df.head(10)

In [None]:
# Prints 5 last Row
df.tail(5)

Show Dataset Info

In [None]:
# Prints the amount of rows and column numbers
df.shape

In [None]:
# Prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
df.info()

### **Central Tendency Measurements**

Select Numeric Variables

In [None]:
# Select Only Numeric Variables
df_num = df[['MonthlyCharges', 'tenure', 'TotalCharges']]

Central Tendency Measurements

In [None]:
# Mean value
mean= df_num.mean()
mean

In [None]:
# Median value
median= df_num.median()
median

In [None]:
# Mode value
mode= df_num.mode()
mode

In [None]:
# Prints descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution
df.describe().transpose()

### **Correlation**

To find the pairwise correlation of all columns in the data frame we can use dataframe.corr() function from Pandas.  Any NA values are automatically excluded and 
any non-numeric data type columns in the data frame automatically  ignored. Read the documentation [HERE](https://https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html). 

There are 4 methods available:
*  pearson: standard correlation coefficient
*  kendall: Kendall Tau correlation coefficient
*  spearman: Spearman rank correlation
*  callable: callable with input two 1d ndarrays

In [None]:
# Count the correlation among the columns using kendal method
correlation_matrix = df.corr(method='kendall')

# Print Correlation Matrix
correlation_matrix

### **Contingency Table**

In [None]:
# Compute Contingency Table
data_crosstab = pd.crosstab(df['PaymentMethod'], 
                            df['Churn'],  
                               margins = False) 
print(data_crosstab) 

### **Data Visualization**

Import Libraries

In [None]:
# Import Library
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

plt.rcParams['figure.figsize'] = (7, 7)
plt.style.use('ggplot')

We can customizing plots with style sheets. The style package adds support for easy-to-switch plotting “styles” with the same parameters as a matplotlibrc file.

There are a number of pre-defined styles provided by matplotlib, that can be seen [HERE](https://matplotlib.org/3.1.0/gallery/style_sheets/style_sheets_reference.html)

Visualize Data Distribution

In [None]:
# Visualize Data Distribution
sns.distplot(df['MonthlyCharges'])

Scatter Plot

In [None]:
# Scatter Plot
sns.scatterplot(x='tenure', y='TotalCharges', data= df)

In [None]:
# Scatter Plot
sns.scatterplot(x='tenure', y='TotalCharges', hue='Churn', data= df)

Bar Chart

In [None]:
# Draw Correlation Map
sns.clustermap(df.corr(), center=0, cmap='vlag', linewidths=.75)

Pairwise Relationship

In [None]:
# Pairwise relationships
sns.pairplot(df, hue='Churn')

## **Data Preparation**

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis.

### **Handling Missing Values**

In [None]:
# Show missing values on data
df.isnull().sum()

In [None]:
# Search for Median Value
median = df['TotalCharges'].median()

# Use Median to Replace Missing Values
df['TotalCharges'].fillna(median, inplace=True)

# Check for Missing Values
df.info()

### **Encode Categorical Value**

In [None]:
# Import Module
from sklearn.preprocessing import OneHotEncoder

# Encoder
encoder = OneHotEncoder(sparse=False)

# Encode Categorical Data
df2 = pd.DataFrame(encoder.fit_transform(df[['gender', 'SeniorCitizen', 'Partner',	'Dependents',	'PhoneService', 'InternetService',	'MultipleLines',	'OnlineSecurity',	'OnlineBackup',	'DeviceProtection',	'TechSupport',	'StreamingTV',	'StreamingMovies', 'Contract',	'PaperlessBilling', 'PaymentMethod']]))
df2.columns = encoder.get_feature_names(['gender', 'SeniorCitizen', 'Partner',	'Dependents',	'PhoneService', 'InternetService',	'MultipleLines',	'OnlineSecurity',	'OnlineBackup',	'DeviceProtection',	'TechSupport',	'StreamingTV',	'StreamingMovies', 'Contract',	'PaperlessBilling', 'PaymentMethod'])

# Replace Categotical Data with Encoded Data
df_encoded = df.drop(['gender', 'SeniorCitizen', 'Partner',	'Dependents',	'PhoneService', 'InternetService',	'MultipleLines',	'OnlineSecurity',	'OnlineBackup',	'DeviceProtection',	'TechSupport',	'StreamingTV',	'StreamingMovies', 'Contract',	'PaperlessBilling', 'PaymentMethod'] ,axis=1, inplace=True)
df_encoded = pd.concat([df, df2], axis=1)

# Show Encoded Dataframe
df_encoded

In [None]:
# Replace Churn Values
df_encoded['Churn'].replace(['No','Yes'],[0,1],inplace=True)

# Show Data
df_encoded

In [None]:
# Show DataFrame Info
df_encoded.info()

In [None]:
# Drop Unwanted Column
df_encoded = df_encoded.drop("customerID", axis=1)
df_encoded.head()

### **Normalization**

Normalization typically means rescales the values into a range of [0,1]

In [None]:
# Select Columns
column_names = df_encoded.columns.tolist()
column_names.remove('Churn')
column_names

In [None]:
#Import MinMax Scaler
from sklearn.preprocessing import MinMaxScaler

# initialize min-max scaler
mm_scaler = MinMaxScaler()
df_norm = df_encoded.copy()

# Transform all attributes
df_norm[column_names] = mm_scaler.fit_transform(df_norm[column_names])
df_norm.sort_index(inplace=True)
df_norm.head()

### **Standarization**

Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

In [None]:
# Import Standard Scaler
from sklearn.preprocessing import StandardScaler

# Initizalize Standard Scaler
standard_scaler = StandardScaler()
df_stand = df_encoded.copy()

# Transform all attributes
df_stand[column_names] = standard_scaler.fit_transform(df_stand[column_names])
df_stand.sort_index(inplace=True)
df_stand.head()

## **Dimensionality Reduction**

### **Principal Component Analysis**

Here we model the logistic regression from breast cancer data (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data). Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Attribute Information:

1.   ID number
2.   Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

1.   radius (mean of distances from center to points on the perimeter)
2.   texture (standard deviation of gray-scale values)
3.   perimeter
4.   area
5.   smoothness (local variation in radius lengths)
6.   compactness (perimeter^2 / area - 1.0)
7.   concavity (severity of concave portions of the contour)
8.   concave points (number of concave portions of the contour)
9.   symmetry
10.   fractal dimension ("coastline approximation" - 1)

Import Data

In [None]:
# Here We Are Using Inbuilt Dataset of Scikit-Learn 
from sklearn.datasets import load_breast_cancer 
  
# Instantiating 
cancer = load_breast_cancer() 

# Creating Dataframe 
df_pca = pd.DataFrame(cancer['data'], columns = cancer['feature_names']) 
df_pca

Preprocess the Data

In [None]:
# Importing Standardscalar Module 
from sklearn.preprocessing import StandardScaler 

# Set Name for StandardScaler as scaler
scaler = StandardScaler() 

# Fit Standardization
scaler.fit(df_pca) 

# Transformed Data
df_pca_scaled = pd.DataFrame(scaler.transform(df_pca))

# Checking Data
df_pca_scaled

**Modeling PCA (Principal Components = 2)**

In [None]:
# Importing PCA Module
from sklearn.decomposition import PCA 

# Modeling PCA with Components = 2 
pca2 = PCA(n_components = 2) 

# Apply Model to Data
pca2.fit(df_pca_scaled) 

# Show Result
x_pca2 = pca2.transform(df_pca_scaled) 
df_pca2_result = pd.DataFrame(x_pca2)
df_pca2_result

Visualize PCA

In [None]:
# Plot 
plt.scatter(x_pca2[:, 0], x_pca2[:, 1], c = cancer['target'], cmap ='plasma') 

# Labeling X and Y Axes 
plt.title('Principal Component Analysis')
plt.xlabel('First Principal Component') 
plt.ylabel('Second Principal Component')

In [None]:
# Plotting Heatmap
df_comp2 = pd.DataFrame(pca2.components_, columns = cancer['feature_names']) 
sns.heatmap(df_comp2)

**Modeling PCA (Principal Components = 3)**

In [None]:
# Importing PCA Module
from sklearn.decomposition import PCA 

# Modeling PCA with Components = 3
pca3 = PCA(n_components = 3) 

# Apply Model to Data
pca3.fit(df_pca_scaled) 

# Show Result
x_pca3 = pca3.transform(df_pca_scaled) 
df_pca3_result = pd.DataFrame(x_pca3)
df_pca3_result

Visualize PCA

In [None]:
# Import Axes3D Module
from mpl_toolkits.mplot3d import Axes3D

# Plot 
ax = plt.axes(projection='3d')
ax.scatter(x_pca3[:, 0], x_pca3[:, 1], x_pca3[:, 2], c = cancer['target'], cmap ='plasma') 

# Labeling X and Y Axes 
plt.title('Principal Component Analysis')
ax.set_xlabel('First Principal Component') 
ax.set_ylabel('Second Principal Component')
ax.set_zlabel('Third Principal Component')

In [None]:
# Plotting Heatmap
df_comp3 = pd.DataFrame(pca3.components_, columns = cancer['feature_names']) 
sns.heatmap(df_comp3)