<a href="https://colab.research.google.com/github/rc-dbe/bigdatacertification/blob/master/Data_Analytics_Essentials_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


*Hands-on of Big Data Analyst with TuV Certified Qualification*


---




# 1. Data Analytics Essentials

**Overview:** 

* Data Exploration:
* Data Preparation:
* Dimensionality Reduction

## Data Exploration

Data exploration is the initial step in data analysis, where users explore a large data set in an unstructured way to uncover initial patterns, characteristics, and points of interest. This process isn’t meant to reveal every bit of information a dataset holds, but rather to help create a broad picture of important trends and major points to study in greater detail.

### Import Data

There are so many ways to import the data from the outside source. One of it is using Pandas, Pandas is a python library that widely used for data analytics.

In [0]:
# import library
import pandas as pd

In practice this time we will use Telco Customer Churn dataset that available at Kaggle. This dataset contains unique customer records for a telecom company called Telco.

This data set includes information about:
*   Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
*   Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges.
*   Demographic info about customers – gender, age range, and if they have partners and dependents.
*   Customers who left within the last month – the column is called Churn

We can import many types of file, from many types of source. Since we are using Goole Collab, we are going to import the dataset from an online link(GitHUb Repo). For the other methods to import data can be seen in the Pandas documentation. [HERE](https://https://pandas.pydata.org/pandas-docs/stable/). 



In [0]:
# Import Data From an online link (GitHub)
df_link = pd.read_csv('https://raw.githubusercontent.com/rc-dbe/bigdatacertification/master/dataset/churn.csv', sep=';',)

In [0]:
# Select On of the dataset and show the data dimention. 
df = df_link

In [0]:
# Prints the amount of rows and column numbers
df.shape

In [0]:
# Prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
df.info()

In [0]:
# Prints 10 first Row
df.head(10)

In [0]:
# Prints 5 last Row
df.tail(5)

### Measure of Central Tendency

In [0]:
# Select Only Numeric Variable
df_num = df[['MonthlyCharges', 'tenure', 'TotalCharges']]

In [0]:
# Mean value
mean= df_num.mean()
mean

In [0]:
# Median value
median= df_num.median()
median

In [0]:
# Mode value
mode= df_num.mode()
mode

In [0]:
# Prints descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution
df.describe().transpose()

### Correlation

To find the pairwise correlation of all columns in the data frame we can use dataframe.corr() function from Pandas.  Any NA values are automatically excluded and 
any non-numeric data type columns in the data frame automatically  ignored. Read the documentation [HERE](https://https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html). 

There are 4 methods available:
*  pearson: standard correlation coefficient
*  kendall: Kendall Tau correlation coefficient
*  spearman: Spearman rank correlation
*  callable: callable with input two 1d ndarrays

In [0]:
# Count the correlation among the columns using kendal method
correlation_matrix = df.corr(method='kendall')

# Print Correlation Matrix
correlation_matrix

### Contingency Table

In [0]:
# Compute Contingency Table
data_crosstab = pd.crosstab(df['PaymentMethod'], 
                            df['Churn'],  
                               margins = False) 
print(data_crosstab) 

### Data Visualization

In [0]:
# Import Library
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

plt.rcParams['figure.figsize'] = (7, 7)
plt.style.use('ggplot')

We can customizing plots with style sheets. The style package adds support for easy-to-switch plotting “styles” with the same parameters as a matplotlibrc file.

There are a number of pre-defined styles provided by matplotlib, that can be seen [HERE](https://matplotlib.org/3.1.0/gallery/style_sheets/style_sheets_reference.html)

#### Visualize Data Distribution

In [0]:
# Visualize Data Distribution
sns.distplot(df['MonthlyCharges'])

#### Scatter Plot

In [0]:
# Scatter Plot
sns.scatterplot(x="tenure", y="TotalCharges", data= df)

#### Bar Chart

In [0]:
# Bar Chart
sns.barplot(x='Churn',y='tenure',data=df)

#### Pairwise Relationship

In [0]:
# Pairwise relationships
sns.pairplot(df)

## Data Preparation

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis.

### Handling Missing Values

In [0]:
# Show missing values on data
df.isnull().sum()

In [0]:
# Impute missing values with the median number
tc_median = df["TotalCharges"].median()
df["TotalCharges"].fillna(tc_median, inplace=True)

In [0]:
# Show the data information to see whether the empty value has been replaced
df.info()

### Encoding Categorical Variable

In [0]:
# Install Category Encoders
! pip install category_encoders

In [0]:
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['gender', 
          'SeniorCitizen', 
          'Partner', 
          'Dependents', 
          'PhoneService', 
          'MultipleLines', 
          'InternetService', 
          'OnlineSecurity', 
          'OnlineBackup', 
          'DeviceProtection', 
          'TechSupport', 
          'StreamingTV', 
          'StreamingMovies', 
          'Contract', 
          'PaperlessBilling', 
          'PaymentMethod'])
df_binary = encoder.fit_transform(df)

df_binary.head()

In [0]:
df_binary.info()

In [0]:
df = df_binary.drop("customerID", axis=1)
df.head()

### Normalization

Normalization typically means rescales the values into a range of [0,1]

In [0]:
column_names = df.columns.tolist()
column_names.remove('Churn')
column_names

In [0]:
#Import MinMax Scaler
from sklearn.preprocessing import MinMaxScaler

# initialize min-max scaler
mm_scaler = MinMaxScaler()
df_norm = df.copy()

# Transform all attributes
df_norm[column_names] = mm_scaler.fit_transform(df_norm[column_names])
df_norm.sort_index(inplace=True)
df_norm.head()

### Standarization

Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

In [0]:
# Import Standard Scaler
from sklearn.preprocessing import StandardScaler

# Initizalize Standard Scaler
standard_scaler = StandardScaler()
df_stand = df.copy()

# Transform all attributes
df_stand[column_names] = standard_scaler.fit_transform(df_stand[column_names])
df_stand.sort_index(inplace=True)
df_stand.head()

## Dimensionality Reduction

### Principal Component Analysis

In [0]:
# Here we are using inbuilt dataset of scikit learn 
from sklearn.datasets import load_breast_cancer 
  
# instantiating 
cancer = load_breast_cancer() 
  
# creating dataframe 
df_pca = pd.DataFrame(cancer['data'], columns = cancer['feature_names']) 
  
# checking head of dataframe 
df_pca.head() 

In [0]:
# Importing standardscalar module 
from sklearn.preprocessing import StandardScaler 

scalar = StandardScaler() 

# fitting 
scalar.fit(df_pca) 
scaled_data = scalar.transform(df_pca) 

# Importing PCA 
from sklearn.decomposition import PCA 

# Let's say, components = 2 
pca = PCA(n_components = 2) 
pca.fit(scaled_data) 
x_pca = pca.transform(scaled_data) 

x_pca.shape 


In [0]:
# Plot 
plt.scatter(x_pca[:, 0], x_pca[:, 1], c = cancer['target'], cmap ='plasma') 

# labeling x and y axes 
plt.xlabel('First Principal Component') 
plt.ylabel('Second Principal Component') 


In [0]:
# plotting heatmap
df_comp = pd.DataFrame(pca.components_, columns = cancer['feature_names']) 
sns.heatmap(df_comp) 