# Title

## Imports and aquiring data

### import required python modules
- `numpy` to work with arrays  
- `pandas` to manipulation and analysis data  
- `matplotlib` visualisation of data  
- `seaborn` visualisation of data   
- `sklearn` Scikit-learn machine learning library 
    - `StandardScaler`
    - `train_test_split`
    - `LinearRegression`
    - `LogisticRegression`
    - `KNeighborsClassifier`
    - `classification_report` 
    - `confusion_matrix`

%matplotlib inline

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer, load_iris # load ... other datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

import nltk # for nlp
#nltk.download_shell()
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from IPython.display import Image  
from six import StringIO  


%matplotlib inline

### Read in data from csv file


In [None]:
data = pd.read_csv('my.csv') # might need , index_col=0 if index is suplied

### Read in data from Excel file

In [None]:
data = pd.read_excel('my.xlsx') # might need , index_col=0 if index is suplied

### View data head

In [None]:
data.head()

### list the indexes of the data

In [None]:
data.columns

### Disply the info about the data

In [None]:
data.info()

### Describe the Data
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

In [None]:
data.describe()

## Prepare Data

### remove all entries with nan

In [None]:
data.dropna(inplace=True, axis=0)

### Use the map function to change any 'yes' values to 1 and 'no'values to 0 


In [None]:
data['col'] = data['col'].map({'yes':1, 'no':0})

### change to categorical content
e.g. column is sex: `male` or `female` has to be transformed into numeric values to be able to work with it the first line can be dropped as the data is redundant   
data is the concatinated wit the new sex option  
`axis=1` means the header row 

In [None]:
# Create dummie variables to include the categorical data in the regression
# Use a convenient method called: 'get_dummies’ to integrate dummie variables
# Extremely important to drop one of the dummies
# Alternatively we will introduce multicollinearity
# drop_first will drop 'Audi'
sex = pd.get_dummies(sex['Sex'], drop_first=True)
data = pd.concat([data, sex], axis=1)

: 

In [None]:
# use cat feat to list colums that sould be changed
cat_feats = ['purpose']
cat_feats = pd.get_dummies(data, columns=cat_feats, drop_first=True)

### Remove Columns that are not required

In [None]:
data.drop(['col1', 'col2', 'col3'], axis=1, inplace=True)

### check if data is null

In [None]:
# The command data.isnull() shows a data file 
# Information whether a data point is null 
# Since True = the data point is missing
# While False = the data point is not missing
data.isnull().sum()

### Preprocessing NLP

In [None]:
# remove stopwords
def text_process(msg):
    nopunc = [char for char in msg if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
data['col with message'].apply(text_process)

bow_trans = CountVectorizer(analyzer=text_process).fit(data['message'])
msg_bow = bow_trans.transform(data['message'])
print(msg_bow.shape)
print(msg_bow.nnz)
sparsity = (100.0 * msg_bow.nnz / (msg_bow.shape[0] * msg_bow.shape[1]))
print(sparsity)

In [None]:
p = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB())
])
p.fit(X_train, y_train)
pred = p.predict(X_test)
print(classification_report(pred, y_test))

## Multicollinearity
### sklearn does not have a built-in way to check for multicollinearity
### Use the relevant module in statsmodel
### Documentation: http://www.statsmodels.org/dev/_modules/statsmodels/stats/outliers_influence.html#variance_inflation_factor

from statsmodels.stats.outliers_influence import variance_inflation_factor

### Declare a data frame and out in all features we want to check for multicollinearity
### Since categorical data is not preprocessed, only take the numerical ones
variables = data_cleaned[['Mileage','Year','EngineV']]

### Create a new data frame which includes all VIFs
### Each variable has its own variance inflation factor. This measure is variable specific
vif = pd.DataFrame()

### Make use of the variance_inflation_factor module, output the respective VIFs 
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

### Include variable names so it is easier to explore the result
vif["Features"] = variables.columns

print(vfi)

## Scale data if required

### Create a standard scaler
Import the scaling module and create a scaler object  
Fit the inputs (calculate the mean and standard deviation feature-wise)  
Scale the features and store them in a new variable (the actual scaling procedure)   


In [None]:
scaler = StandardScaler()

### fit the features without the lable to the scaler

In [None]:
scaler.fit(data.drop('Label col', axis=1))

### transform the date into a scaled dataset

In [None]:
scaled_features = scaler.transform(data.drop('label col', axis=1))

### create dataset of features 
make sure columns doesnt contain the lable

In [None]:
data_features = pd.DataFrame(scaled_features, columns=data[['col', 'col']])

### PCA 


In [None]:
pca = PCA(n_components=2)

In [None]:
pca.fit(scaled_features) # scaled date from scaler transform
x_pac = pca.transform(scaled_features)

## Display data

## create a distribuztion plot (histogram auf deutsch)

In [None]:
sns.displot(ads['col you want to view'])

### pandas hist plot

In [None]:
data['col'].hist()

## count plot

In [None]:
sns.countplot(x='col1', hue='col2', data=data)

## split lm plot into 2 plots next to each other Lm plot

In [None]:
sns.lmplot(x='col1', y='col2e', hue='col3', col='col4' ,data=data)

## create a joint plot
`x` is the col1 `y` is the other cloum you want to compare wit inside the dataframe 
can be changed to a kde plot using `kind`

In [None]:
sns.jointplot(x='col1', y='col2', data=data)

In [None]:
sns.jointplot(x='col1', y='col2', data=data, kind='kde', color='red')

### pair plot 
compares the data with each other use `hue` to split by value e.g. column is sex: `male` or `female`

In [None]:
sns.pairplot(data, hue='col')

### KDE plot

In [None]:
sns.kdeplot(x=data['col1'], y=data['col2'], cmap='plasma', shade=True, thresh=0.05)

### Scatter plot

In [None]:
sns.scatterplot(x=data['1'], y=data['2'], c=data[], cmap='rainbow')

### create 2 diagrams


In [None]:
f, (ax1, ax2) = plt.subplots(1,2, sharey=True, figsize=(10,6))
ax1.set_title('K_Means')
ax1.scatter(data[0][:,0], data[0][:,1], c=km.labels_)

ax2.set_title('Original')
ax2.scatter(data[0][:,0], data[0][:,1], c=data[1])

### PCA diagramm

## Machiine learning

### Spliting the dataset into training and test
`X` are the features which are used to predict `y`.  
`y` are the outcomes we are looking for, it is not allowed to be in the features

```python
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
```

In [None]:
X = data.drop('Lable',axis=1) # X = data[['col1', 'col2']]
y =data['Lable']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

### Use your model
#### LinearRegression,  LogisticRegression


In [None]:
model_linreg = LinearRegression()
model_logreg = LogisticRegression()


#just for the flow
model = model_logreg

### linear Regression Plot

In [None]:
plt.scatter(X,y)
# coef feature  and coef const
yhat = coeffeat*X + coef_const
fig = plt.plot(X,yhat, lw=4, c='orange', label ='regression line')
plt.xlabel('Feature', fontsize = 20)
plt.ylabel('lable', fontsize = 20)
plt.show()

### Regression using statsmodels.api -> sm

In [None]:
X = sm.add_constant(X)
results = sm.OLS(y,X).fit()
results.summary()

#### KNeighborsClassifier
Elbow function to get the best `n_neighbors`

In [None]:

# elbow function
error_rate = []

for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

#### KNN elbow function display

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K value')
plt.xlabel('K')
plt.ylabel('Error Rate')

#### Decision TreeClassifier

In [None]:
model = DecisionTreeClassifier()

#### create knn clasifyer with best solution of elbow funftion

In [None]:
n = 5 # value from previous code
model_knn = KNeighborsClassifier(n_neighbors=n)

### Support Vector Machines

In [None]:
model = SVC()

### gridsearch svc

In [None]:
param_grid = {'C': [0.1,1,10,100,1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel':['rbf']}
grid = GridSearchCV(SVC(), param_grid, refit=True)
grid.fit(X_train,y_train)
print(grid.best_params_)
print(grid.best_estimator_)
grid_pred = grid.predict(X_test)

### Random Forest 

In [None]:
model = RandomForestClassifier(n_estimators=100)

### KMeans Classifier

In [None]:
model = KMeans(n_clusters=4)

## Fit the data to the model
use the training features `X_train` and the training Lables `y_train`

In [None]:
model.fit(X_train,y_train)

### KMeans 
fit one col of data to kmeans

In [None]:
model.fit(data)
#show centers for kmeans
model.cluster_centers_
# show labels
model.labels_

### predict the data
using the test features `X_test`

In [None]:
predictions = model.predict(X_test)

### Linear regression data

In [None]:
# Find the R-squared of the model
model.score(X_train,y_train)
# Intercept_
model.intercept_
# coef data
model.coef_

### Evaluate the result of the prediction
give the correct test lables `y_test` and the predicted values `predictions` to the `classification_report()` to see how well the prediction did in comparison

The confusion matrix can also be used to evaluate the Predictions by seeing how well it did 

## Evaluate The data

### classification_report

In [None]:
print(classification_report(y_test, predictions))

### Confusion Matrix

In [None]:
print(confusion_matrix(y_test, predictions))

### creat tree

#### Entscheidungsbaum Visualisierung

SciKit Learn verfügt über die eingebaute Fähigkeit Entscheidungsbäume zu visualisieren. 

Dies wirst du vermutlich nicht oft benötigen und es erfordert die Installation von `pydot` und `graphviz`:

    conda install graphviz
    
    pip install pydot
    
    pip install six
    
*Hinweis 1: Beachtet bitte die Reihenfolge. Je nach eueren Python- und Library-Versionen benötigt graphviz die Anpassung einiger Versionen. Diese müssen wir zur Nutzung von Decision Tree Visualisierungen zustimmen.*

*Hinweis 2: Unter Ubuntu Linux ist die Installation von graphviz mit sudo apt install graphviz -y erforderlich.*

Nichtsdestotrotz schauen wir uns der Vollständigkeit halber ein Beispiel davon an!

In [None]:
from IPython.display import Image  
from six import StringIO  
from sklearn.tree import export_graphviz
import pydot 

features = list(data.columns[1:])
features
dot_data = StringIO()  
export_graphviz(model, out_file=dot_data,feature_names=features,filled=True,rounded=True)

graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())  