<a href="https://colab.research.google.com/github/rc-dbe/bigdatacertification/blob/master/Data_Mining_Model_Structured_Data_(Part_1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Hands-on of Big Data Analyst with TuV Certified Qualification*


---



# 2. Data Mining Model - Structured Data (Part 1)

Sub topics covered in this practice:
* Regression

## Regression
Regression analysis is a basic method used in statistical analysis of data. It’s a statistical method which allows estimating the relationships among variables. One needs to identify dependent variable which will vary based on the value of the independent variable.

### Linear Regression
linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). [Wikipedia](https://en.wikipedia.org/wiki/Linear_regression)

In [0]:
# Import library
import pandas as pd

In [0]:
# Import Dataset
salary_df = pd.read_csv('https://raw.githubusercontent.com/rc-dbe/bigdatacertification/master/dataset/Salary_Data.csv')

In [0]:
# Prints the Dataset Information
salary_df.info()

In [0]:
# Prints 10 first Row
salary_df.head(30)

In [0]:
# Prints descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution
salary_df.describe().transpose()

In [0]:
# Import Library to Visualize the Data
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


plt.rcParams['figure.figsize'] = (7, 7)
plt.style.use('ggplot')

# Show the Distribuion
sns.scatterplot(x="YearsExperience", y="Salary", data= salary_df)


In [0]:
# Select X and Y Variable
X = salary_df.iloc[:, :-1].values
Y = salary_df.iloc[:, 1].values

In [0]:
# Modelling
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X, Y)

In [0]:
# Show Coefficent and Intercept
print('Coefficient = ', lr.coef_)
print('Intercept =', lr.intercept_)

In [0]:
plt.scatter(X, Y)
plt.plot(X, lr.predict(X), color = "green")
plt.title("Salary vs Experience")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()

### Logistic Regression
Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection problems. It computes the probability of an event occurrence.

In this practice, we will learn to make a Logistic Regression model to predict whether a customer will churn or not (also known as customer attrition). The dataset to be used is a telco churn dataset from kaggle (This dataset has been preprocessed before). The data set includes information about:
*   Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
*   Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges.
*   Demographic info about customers – gender, age range, and if they have partners and dependents.
*   Customers who left within the last month – the column is called Churn


In [0]:
# Import Library
import pandas as pd 

#Import the files to Google Colab
url = 'https://raw.githubusercontent.com/rc-dbe/bigdatacertification/master/dataset/churn_trasnsformed_new.csv'
df_csv = pd.read_csv(url, sep=',',)

# Show 10 first Row
df_csv.head()

In [0]:
# Remove "Unnamed:O" Coloumn
df = df_csv.drop("Unnamed: 0", axis=1)
df.head()

In [0]:
# Check the Data Infomation
df.info()

In [0]:
# Selecting the Feature, by remove the unused feature 
feature = ['Churn', 'TotalCharges']
train_feature = df.drop(feature, axis=1)

# Set The Target
train_target = df["Churn"]

In [0]:
# Show the Feature
train_feature.head(5)

In [0]:
# Split Data
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(train_feature ,train_target, shuffle = True, test_size=0.3, random_state=1)

In [0]:
# Show the training data
X_train.head()

In [0]:
# Import the library
from sklearn.linear_model import LogisticRegression

# Train the Logistic Regression Model using Default Parameter
logreg = LogisticRegression()
logreg_model = logreg.fit(X_train,y_train)


# Predict To X_test
y_predlg =logreg.predict(X_test)

In [0]:
# Import the metrics class
from sklearn import metrics

cnf_matrixlg = metrics.confusion_matrix(y_test, y_predlg)
cnf_matrixlg

In [0]:
# Show the Accuracy, Precision, Recall
acc_lg = metrics.accuracy_score(y_test, y_predlg)
prec_lg = metrics.precision_score(y_test, y_predlg)
rec_lg = metrics.recall_score(y_test, y_predlg)
f1_lg = metrics.f1_score(y_test, y_predlg)
kappa_lg = metrics.cohen_kappa_score(y_test, y_predlg)

print("Accuracy:", acc_lg )
print("Precision:", prec_lg)
print("Recall:", rec_lg)
print("F1 Score:", f1_lg)
print("Cohens Kappa Score:", kappa_lg)
