# Personal notebook for Data Science
> **PART 1: Introduction to Pyton**<br>
> **PART 2: Introduction to Functions**<br>
> **PART 3: Cleaning data**<br>
> Created by: Adem GENCER

# Personal notebook for Data Science
> **PART 1: Introduction to Pyton**<br>
*> Created by: Adem GENCER*

### Contents
1. Import libraries
2. Load and read database
3. Basic database functions
4. Correlation map
5. Matplotlib
6. Dictionary
7. Pandas
8. Loops

### 1. Import libraries 
*     **numpy** for linear algebra
*     **pandas** for dataframes
*     **matplotlib** for plot graphs
*     **seaborn** for visualisation

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # import matplot library
import seaborn as sns # import seaborn library

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### 2. Load and read a database. 

In [None]:
df = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")

### 3. Basic database functions
* Show data **INFO**
* Loading **HEAD** and **TAIL** of data
* **COLUMNS** of data

In [None]:
df.info()

This code "info()" shows how the data structured. 
* How many entries
* How many columns
* Column types
* Memory usage

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.columns

In [None]:
df.shape

### 4. Correlation map

In [None]:
df.corr()

In [None]:
f, ax = plt.subplots(figsize=(10,10))
sns.heatmap(df.corr(), annot=True, linewidth=.5, fmt = ".1f", ax=ax)

### 5. MATPLOTLIB
Using for data virtualisation and creating **LINE**, **SCATTER**, **BAR**, **HISTOGRAM** graphics.


In [None]:
df.chol.plot(kind="line", color="blue", label="Chol", linewidth=1, alpha=.5, grid=True, linestyle="-")
df.age.plot(kind="line", color="red", label="Age", linewidth=1, alpha=.7, grid=True, linestyle="-")
plt.legend(loc="upper right") # Legend position
plt.xlabel("Case ID")           # Label of X axis
plt.ylabel("Level")     # Label of Y axis
plt.title("Cholesterol")      # Title of graph
plt.show()                    #Removes scripting

In [None]:
df.loc[:,["age", "chol"]].plot(subplots=True)
plt.show()

In [None]:
df.plot(kind="scatter", x="age", y="chol", alpha=.7, color="green")
plt.xlabel("Age")
plt.ylabel("Chol")
plt.title("Cholesterol Age Scatter Graphic")
plt.show()

In [None]:
df.age.plot(kind="hist", bins=20, range= (0,100), density=True)   # normed is depracated Use density instead
plt.show()

In [None]:
df.age.plot(kind="hist", cumulative=True)
plt.show()

### 6. DICTIONARY


Dictionary can be used for storing KEY-VALUE pairs.

In [None]:
myDic = { "name": "Adem", "surname": "Gencer", "age": 38}  # Define a dictionary
print(myDic)
print(myDic.keys()) # Get dictionary keys
print(myDic.values()) # Get dictionary values

In [None]:
myDic = { "name": "Adem", "surname": "Gencer", "age": 38}  # Define a dictionary
myDic["name"] = "John"        # Edit existing value
myDic["Location"] = "Turkey"  # Add a new KEY - VALUE pair
print(myDic)
del myDic["name"]             # Delete a KEY - VALUE pair
print(myDic)
myDic.clear()                 # Clear a dictionary
print(myDic)
#del myDic                    # Delete a dictionary
#print(myDic)

In [None]:
myDic = { "name": "Adem", "surname": "Gencer", "age": 38}  # Define a dictionary
print("age" in myDic)      # Search a KEY in a dictionary

### 7. PANDAS
Pandas is a library where u can use your data like a structured database format.

In [None]:
print(df["age"])     # series
print("--------------------------------")
print(df[["age"]])   # dataframe

### Indexing

In [None]:
# df.age.head()
# df["age"][1]
# df["age"].head()
# df.age[1]             
# df[["age", "sex"]]       # Selecting columns

In [None]:
dfEx = df.copy()
dfEx = dfEx.set_index("age")          # Change index.
dfEx.head()

In [None]:
dfEx.index.name = "myIndex"
dfEx.head()

In [None]:
dfEx.index = range(1,1515,5)
dfEx.head()

### Outher - inner index

In [None]:
dfEx.set_index(["sex", "cp"]).head()

In [None]:
dfEx.unstack(level=0).head()


### GroupBy

In [None]:
df.groupby("sex").mean()

In [None]:
df.groupby("sex")[["age", "chol"]].min()

### Pivot table

In [None]:
df.pivot( columns="cp", values="chol").head(10)

### Selecting

In [None]:
df.loc[10:0:-1, "cp":"fbs"]  # Reverse select

### Filtering data in pandas
* Filter a data will result a TRUE-FALSE series.
* Apply this filter on data will show you data values.

In [None]:
xFilter = df["age"] > 70 
# This script will creates a filter based on data value.

In [None]:
df[xFilter]
# This code applies filter to a dataframe and shows only records which has TRUE.

Create AGE and CHOL filters.

In [None]:
xFilterAge = df["age"] > 60
xFilterChol = df["chol"] > 300

In [None]:
df[xFilterAge].head() # This shows AGE filtered datas

In [None]:
df[xFilterChol].head() # This shows CHOL filtered datas

In [None]:
df[xFilterAge & xFilterChol]  # Apple AGE, CHOL filters with AND operator.

#This code can be written shortly
#df[(df["age"] > 60) & (df["chol"] > 300)]

In [None]:
df.age[xFilterChol].head()    # Filter with xFilterChol but show age column.

### Transforming

In [None]:
def ageM(n):
    return n*12
df.age.apply(ageM).head()

# df.age.apply(lambda n: n*12)          # In a short way...

# This function will transform age cloumn. Age = age*12

In [None]:
df["agelabel"] = df.age * df.sex        # Only experimental result. It is meaningless.
df.head()

In [None]:
df["stayofhospital"] = 0              # Create a new column
df.head()

### TIMESERIES

In [None]:


dfExt = df.head().copy()
dateList = ["2020-01-10","2020-01-12","2020-01-12","2020-01-14","2020-01-15"]
dateTime = pd.to_datetime(dateList)
dfExt["date"] = dateTime
dfExt = dfExt.set_index("date")  # Change index to timeSeries data.
dfExt.head()

In [None]:
dfExt.loc["2020-01-12":"2020-01-14"]   # Use TimeSeries for index.

### Resampling with TimeSeries
> When using timeSeries we can resample
* D: Day 
* M: Month 
* A: Year

In [None]:
dfExt.resample("D").mean()

In [None]:
dfExt.resample("D").mean().interpolate("linear")   # Interpolate missing datas with linear function.

### 8. LOOPS

There is two different loop structure in Pyton. **FOR** loops and **WHILE** loops. 


In [None]:
for index, value in enumerate(df["age"][0:10]):
    print("index:", index, " value:", value)


In [None]:
for index, value in df[["age"]][10:11].iterrows():
    print("index:",index," value:",value)

# Personal notebook for Data Science
> **PART 2: Introduction to Functions**<br>
*> Created by: Adem GENCER*

### Contents
1. Functions basics
2. Function variables
3. Lambda function
4. Iterable
5. Zip
6. List

### 1. Functions basics

In [None]:
x = 5 # This is a global variable
def myFunc():
    """ This is definition of a function"""
    y = 2               # This is a local variable. This variable cannot reached outside.
    result = x*y        # This will print x*y. 
                        # If there is no X value inside function (local) it can be searched globally.
    print("Function returned! Result = ", result)
myFunc()

Functions can be NESTED.. 

### 2. Function variables
* Default variables
* List of variables (*args)
* Dictionary of variables (**kwargs)

In [None]:
def myFunc(x, y = 3, z = 5):
    """ You must declare x when you call this function. 
        If you dont declare y, z default variables used in function... """
    print(x+y+z)

myFunc(5)
myFunc(2,1)

In [None]:
def myFunc(*args):
    """ List of arguments. You can pass flexible parameters. """
    for each in args:
        print(each)
    # print("--",args)  # List of variables.
myFunc("first")
print("--------")
myFunc("First", "Second", 5)

In [None]:
def myFunc(**kwargs):
    """ Pass dictionary to a function. """
    for key, value in kwargs.items():
        print("key:", key, " value:", value)
    
myFunc(name = "Adem", age = 38)

### 3. Lambda function
Lambda function is a one string function.

In [None]:
multiply = lambda x, y: x*y  # One row function. It returns value.
print(multiply(5,4))

### 4. Iterable
Iterate a string, list or dictionary with Iter() function.

In [None]:
mystr = "TestString"
print(next(iter(mystr)))
print(*iter(mystr))

### 5. Zip
Combine two lists. First is **keys** and second is **values**.

In [None]:
listKey = {"name", "surname", "age"}
listVal = {"Adem", "Gencer", 38}

zipList = zip(listKey, listVal)
myList = list(zipList)
print(myList)


In [None]:
unZip = zip(*myList)
unList1, unList2 = list(unZip)
print(unList1)        # This is a tuple
print(list(unList2))  # This is a list

### 6. List

In [None]:
myList = [1,2,3,4,5]
revList = [i + 1 for i in myList]  # Do a function for every element.
print(revList)

In [None]:
myList = [1,2,3,4,5]
revList = [0 if i % 2 == 0 else 1 for i in myList]  # Get 0 for even numbers.

print(myList)
print(revList)

### Example for dataframe
* Get a mean of AGE values. This is our cutoff value.
* Define a column (AGELEVEL) and iterate data with loop. If value is above pass HIGH AGE else LOW AGE.
* Show first 10 record with AGE, AGELEVEL columns.

In [None]:
cutoff = sum(df.age)/len(df.age)
print("Cutoff value: ",cutoff)

df["ageLevel"] = ["High age" if i > cutoff else "Low age" for i in df.age]
df.loc[:10,["age","ageLevel"]]

# Personal notebook for Data Science
> **PART 3: Cleaning data**<br>
*> Created by: Adem GENCER*

### Contents
1. Exploratory data analysis (EDA)
2. Melt and pivot data
3. Concatenating data
4. Datatypes
5. Missing data

### 1. Exploratory data analysis (EDA)
* **Lower quartile**: %25, Q1
* **Median**: %50, Q2
* **Upper quartile**: %75, Q3

**OUTLIER**: 1.5 IQR up or down

In [None]:
df.describe()

In [None]:
df.boxplot(column = "chol", by = "sex")
plt.show()

### 2. Melt and pivot data**

In [None]:
df_short = df.head()
meltedData = pd.melt(frame=df_short, id_vars = "age", value_vars = ["sex", "chol"])
meltedData

In [None]:
meltedData.pivot(index = "age", columns = "variable", values = "value")

### 3. Concatenating data

Combining dataframes.
* **axis**: 0 for vertical, 1 for horizontal

In [None]:
headerData = df.head()
tailData = df.tail()
pd.concat([headerData, tailData], axis=0, ignore_index= True)

### 4. Datatypes

In [None]:
df.dtypes

DataTypes can be converted via **astype()** command.

In [None]:
df.ca = df.ca.astype(float)
df.dtypes

### 5. Missing data

* Drop missing data with **dropna()**
* Fill missing value with **fillna()**
* Fill missing value with sample data 

In [None]:
df.sex.value_counts(dropna=False)   # Print value counts (drop null data = false)
df.sex.dropna(inplace= True)        # Check data. If value is null drop that raw
assert 1==1                         # Check previous code.

In [None]:
assert df.columns[0] == "age"    # If code is true it doesnot return anything.