In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Diagnose Data for Cleaning**

We need to diagnose and clean data before exploring.

Unclean data:
* Column name inconsistency like upper-lower case letter or space between words
* Missing data
* Different language
Let's check how we use head, tail, columns, shape and info methods to diagnose data

In [1]:
data = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')
data.head()

In [1]:
data.tail()

In [1]:
data.columns

In [1]:
data.shape

In [1]:
data.info()

**EXPLORATORY DATA ANALYSIS**

value_counts(): Counts frequency of values

outliers: the value that is considerably higher or lower from rest of the data

Lets say value at 75% is Q3 and value at 25% is Q1.

Outlier are smaller than Q1 - 1.5(Q3-Q1) and bigger than Q3 + 1.5(Q3-Q1). (Q3-Q1) = IQR

We will use describe() method. Describe method includes:

* count: number of entries
* mean: average of entries
* std: standart deviation
* min: minimum entry
* 25%: first quantile
* 50%: median or second quantile
* 75%: third quantile
* max: maximum entry

In [1]:
print(data['age'].value_counts(dropna =False))  # if there are nan values that also be counted

In [1]:
print(data['sex'].value_counts())

In [1]:
data.describe() #ignore null entries
#564 chol is an outlier in this case for an example.
#Q3 = 274.5, Q1 = 211, Outlier line = Q3 + 1.5(Q3-Q1) = 369,75

**VISUAL EXPLORATORY DATA ANALYSIS**

Box plots: visualize basic statistics like outliers, min/max or quantiles

In [1]:
# Black line at top is max
# Blue line at top is 75%
# Green line is median (50%)
# Blue line at bottom is 25%
# Black line at bottom is min
# There are no outliers
data.boxplot(column = 'chol',by = 'sex')

**Tidy Data**

We tidy data with melt().

In [1]:
data_new = data.head()    # I only take 5 rows into new data
data_new

In [1]:
melted = pd.melt(frame=data_new,id_vars = 'age', value_vars= ['chol','thalach'])
melted 
#melting is bridge between pandas and seaborn
#named of variable and value are default indexes of melt function.

**Pivoting Data**

Reverse of melting.

In [1]:
# Index is age
# I want to make that columns are variable
# Finally values in columns are value
melted.pivot(index = 'age', columns = 'variable',values='value')

**Concatenating Data**

We able to concatenate two dataframe.

In [1]:
data1 = data.head(3)
data2 = data.tail(3)
conc_data_row = pd.concat([data1,data2],axis =0,ignore_index =True) # axis = 0 : adds dataframes in row
conc_data_row

In [1]:
data1 = data['age'].head()
data2 = data['trestbps'].head()
data3 = data['thalach'].head()
conc_data_col = pd.concat([data1, data2, data3],axis =1) # axis = 1 : adds dataframes in column
conc_data_col

**Data Types**

There are 5 basic data types: object(string),boolean, integer, float and categorical.

We can make conversion data types like from str to categorical or from int to float

Why is category important:

* make dataframe smaller in memory
* can be utilized for anlaysis especially for sklearn

In [1]:
data.dtypes

In [1]:
data['age'] = data['age'].astype('float')
data['sex']  = data['sex'].astype('float')
#The purpose is make the data much cleaner and readable.
data.dtypes

**Missing Data and Testing with Assert**

If we encounter with missing data, what we can do:

1. leave as is
2. drop them with dropna()
3. fill missing value with fillna()
4. fill missing values with test statistics like mean
5. Assert statement: check that you can turn on or turn off when you are done with your testing of the program

In [1]:
data.info()

In [1]:
new_data = pd.read_csv('/kaggle/input/2020-us-general-election-turnout-rates/2020 November General Election - Turnout Rates.csv')
new_data.head()

In [1]:
new_data.shape

In [1]:
new_data.info()

In [1]:
new_data['Vote for Highest Office (President)'] = new_data['Vote for Highest Office (President)'].astype('category')

In [1]:
new_data["Vote for Highest Office (President)"].value_counts(dropna =False)
# as you can see there are  28 NaN value

We can drop NaN values easily.

In [1]:
n_data1=new_data   # also we will use data to fill missing value so I assign it to data1 variable
n_data1["Vote for Highest Office (President)"].dropna(inplace = True)

In [1]:
new_data["Vote for Highest Office (President)"].value_counts()

In [1]:
assert new_data["Vote for Highest Office (President)"].notnull().all()

In [1]:
new_data['Vote for Highest Office (President)'] = new_data['Vote for Highest Office (President)'].cat.add_categories('Unknown')
new_data['Vote for Highest Office (President)'].fillna('Unknown', inplace =True)

In [1]:
assert  new_data['Vote for Highest Office (President)'].notnull().all() #return nothing because we do not have NaN values