# INTRODUCTION TO STATISTICS PART 1
<font color = '#5E1742'>
In this notebook, we will examine the basic concepts of Statistics with the heart disease data set and do the python application.

![del3-1024x682.jpg](attachment:del3-1024x682.jpg)

<font color = '#cdcd00'>
Content: 
    
1. [Variable and Variable Types](#1)
    *            [1.1 According to Their Structure](#2)
    *            [1.2 According to their Property](#3)
    *            [1.3 According to Role in Scientific Researches](#4)
   
1. [The concept of zero in measurement](#5)
     *            [2.1 Absolute Zero (Natural)](#6)
     *            [2.2 Relative Zero](#7)
1. [Levels of Measurement:](#8) 
    *            [3.1 Interval](#9)
    *            [3.2 Ratio](#10)
    *            [3.3 Nominal](#11)
    *            [3.4 Ordinal](#12)
1. [Measures of Central Tendency](#13)
    *            [4.1 Mean](#14)
    *            [4.2 Median](#15)
    *            [4.3 Mode](#16)
1. [Measures of Central Dispersion](#17)
    *            [5.1 Range](#18)
    *            [5.2 Variance](#19)
    *            [5.3 Standart Deviation](#20)
    *            [5.4 Skewness](#21)
    *            [5.5 Kurtosis](#22)
    *            [5.6 Quartile](#23)
1. [Statistical Thinking Model: Mooney](#24) 
    *           [6.1 Definition of Data](#25)
    *           [6.2 Organizing Data](#26)
    *           [6.3 Representation of Data](#27)
    *           [6.4 Analyzing and Interpreting Data](#28)
1. [Population and Sample](#29) 
1. [Confidence Interval](#30) 
 

<a id = "1"></a><br>
# 1. Variable and Variable Types:
### It is a quantity that takes different values from unit to unit.

<a id = "2"></a><br>
## 1.1 According to their structure:

**1.1.1 Numerical variables:** Mathematically expressed by numbers. *Price, size*

**1.1.2 Categorical Variables:** Variables that cannot be mathematically expressed with numbers. *Gender* is a categorical variable. Male and female are the classes of this variable.

In [None]:
import pandas as pd
import statistics
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
data=pd.read_csv('../input/heart-disease-uci/heart.csv')
numerical_variables=data[['age','trestbps','chol','thalach','oldpeak']]
categorical_variables=data.drop(['age','trestbps','chol','thalach','oldpeak'],axis=1)
print("Numerical: ")
print(numerical_variables.head())
print("Categorical: ")
print(categorical_variables.head())

<a id = "3"></a><br>
## 1.2 According to their property

**1.2.1 Continuous Variable:** Variables in which an infinite value can be written between two values. *Age, LSAT score*. *Height (For example, I can write infinite values between 180 cm - 181 cm)*. It cannot be counted as a grain.


**1.2.2 Discontinuous Variable:** It is a variable where a limited number of values can be written between two values. It is expressed without commas. For example, *Number of goals, Number of questions solved, blood groups.*

<a id = "4"></a><br>
## 1.3 According to role in scientific researches:

**1.3.1 Dependent Variable:** It is the variable whose effect is curious, affecting, causing. For example, does smoking affect heart health ? *Smoking = Dependent Variable*

**1.3.2 Independent Variable:** The variable affected by the independent variable. In the previous example, *Heart health = Independent Variable*

<a id = "5"></a><br>
# 2. The concept of zero in measurement:
It refers to the starting point of measurements.
<a id = "6"></a><br>
**2.1 Absolute Zero (Natural)**: It means nothingness. It cannot take a negative value. For example, *if number of solved questions is equal zero, it means you don't solved any questions.*
<a id = "7"></a><br>
**2.2 Relative Zero:** It can take a negative value. it does not mean nothingness. For example, *degrees centigrade*

<a id = "8"></a><br>
# 3. Levels of Measurement:



<a id = "9"></a><br>
* **3.1 Interval:** In this scale, units are grouped by range value and there are significant and standard ranges between measurement values. Differences between numbers are significant.For example, *temperature, exam scores.*
 
 <a id = "10"></a><br>
* **3.2 Ratio:** Ratio scale has all the features of interval scale. The starting point is zero.(Absolute zero).For example, someone who is zero years old actually does not exist or if the number of students in the class is zero, it means that nobody is in the class.

 <a id = "11"></a><br>
* **3.3 Nominal:** Indicates whether the objects are similar in properties or not. Mathematical operations cannot be performed between classes. But frequency and mode can be calculated.There is no hierarchy between classes. For example, *Gender, Marital Status*
 
 <a id = "12"></a><br>
* **3.4 Ordinal:** Variables can be ordered according to a criterion. There is hierarchy between variables. For exapmle, *Military ranks(Captain, Major)*

<a id = "13"></a><br>
# 4. Measures of Central Tendency:
Measures of central tendency are numbers that indicate the centre of a set of ordered numerical data.
1. Mean
1. Median
1. Mode


<a id = "14"></a><br>
### 4.1 Mean
The mean is calculated by adding up all of the values and dividing by the number of values. 
Let's find the mean of numerical variables.

In [None]:
print(numerical_variables.mean(axis=0))

<a id = "15"></a><br>
### 4.2 Median
The median the "middle" of a set of numbers in ascending or decending order. Let's find the median of numerical variables.

In [None]:
print(numerical_variables.median(axis=0))

## Note: If you have a lot of outliers, you should use the median instead of the mean.

<a id = "16"></a><br>
### 4.3 Mode 
The mode is the most frequently occurring number.
* Mode value is not a reliable measure when the number of population or sample is small.
* Mode value not affected by outliers
* It can be calculated in numerical and categorical variables.

In [None]:
print(data.mode(axis=0))

In [None]:
#for all variables
data.describe()

<a id = "17"></a><br>
# 5. Measures of Central Dispersion
Measures of central dispersion show how “spread out” the elements of a data set are from the mean.
1. Range
1. Variance
1. Standard deviation
1. Skewness
1. Kurtosis
1. Quartile

<a id = "18"></a><br>
### Range:
The range of a data set is the difference between the largest value and the smallest value. Thing an exam. A large range of exam results means that the distinctiveness of the exam is high.

In [None]:
import numpy as np
def minmax(val_list):
    min_val = (val_list).min()[0:14]
    max_val = (val_list).max()[0:14]
    rangevalue=max_val - min_val
    print("Maximum value: {0} ".format(max_val) +  "  Minimum value: {0} ".format(min_val)
          + " Range value: {0} ".format(rangevalue))
    return
minmax(numerical_variables)

<a id = "19"></a><br>
### Variance:
You can think of the variance as the average squared difference between the elements of a data set and the mean. 

In [None]:
print(statistics.variance(data['age']))

<a id = "20"></a><br>
### Standard deviation:
The standard deviation is simply the square root of the variance. If the standard deviation is high, we can say that the distribution is heterogeneous.

In [None]:
age_sdev=statistics.stdev(data['age'])
print('The Standard deviation of age: {:.4f}'.format(age_sdev))

<a id = "21"></a><br>
### Skewness: 
It is that the distribution of a variable is not symmetrical. If the coefficient of skewness is 0, it is normal distribution. You can understand better with the picture below

![Relationship_between_mean_and_median_under_different_skewness.png](attachment:Relationship_between_mean_and_median_under_different_skewness.png)

### Let's look at the coefficient of skewness

In [None]:
data.skew(axis = 0, skipna = True)

### And graph

In [None]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
f, axes = plt.subplots(2, 2, figsize=(15, 10), sharex=False)
sns.distplot(data.age.values, color="skyblue", ax=axes[0,0])
sns.distplot(data.iloc[:,4], color="olive", ax=axes[0,1])
sns.distplot(data.iloc[:,7], color="gold", ax=axes[1,0])
sns.distplot(data.iloc[:,3], color="teal", ax=axes[1,1])

#thanks to MMelnicki
#https://stackoverflow.com/a/54775278

<a id = "22"></a><br>
### Kurtosis: 
It shows the sharpness of the dominance of distribution. If the coefficient of kurtosis is 0, it is normal distribution.

In [None]:
data.kurtosis(axis = 0, skipna = True)

<a id = "23"></a><br>
### Quartile:
The quartile measures the spread of values above and below the mean by dividing the distribution into four groups.
A quartile divides data into three points—**a lower quartile, median, and upper quartile**—to form four groups of the dataset.

![median-quartiles.png](attachment:median-quartiles.png)

In [None]:
numerical_variables.quantile([0.25,0.5,0.75])

### We can find outliers using quarters.

In [None]:
def outlier_treatment(numerical_variables):
     sorted(numerical_variables)
     Q1,Q3 = np.percentile(numerical_variables , [25,75])
     IQR = Q3 - Q1 #IQR = Interquartile Range
     lowerrange = Q1 - (1.5 * IQR) #below this number is outlier
     upperrange = Q3 + (1.5 * IQR) #the higher of this number is outlier
     return lowerrange,upperrange
     lowerbound,upperbound = outlier_treatment(numerical_variables.columns)
     numerical_variables[(numerical_variables.columns < lowerrange) |
                         (numerical_variables.columns > upperrange)]
     return
print("Outlier Borders for Age Column:")
print(outlier_treatment(numerical_variables.age)) #for age


### We can use the box plot to visualize outliers.

In [None]:
numerical_variables.plot(kind='box', subplots=True, layout=(4,4), sharex=False,sharey=False ,figsize =(15,15))
plt.show()

<a id = "24"></a><br>
# 6. Statistical Thinking Model: MOONEY
It is the guide that models the path from data literacy to data analytics.

### Stages of the Mooney
1. Definition of data
1. Organizing data
1. Representation of data
1. Analyzing and interpreting data

<a id = "25"></a><br>
## 1. Definition of data 
### What are the variables measured in the dataset?

In [None]:
data.columns

### What are the types of variables ?

In [None]:
data.head()


* Age: Numerical variable
* Sex: Categorical variable
* Cp: Categorical Variable
* Trestbps: Numerical Variable
* Chol: Numerical variable
* Fbs: Categorical variable
* Restecg: Categorical variable
* Thalach: Numerical variable
* Exang: categorical variable
* Oldpeak: Numerical variable
* Slope: Categorical variable
* Ca: Categorical variable
* Thal: Categorical variable
* Target: Categorical variable

### What scale are the variables you specified measured ?

* Age: Ratio
* Sex: Nominal
* Cp: Nominal
* Trestbps: Ratio (if indicates the severity of the pain --> Ordinal)
* Chol: Ratio
* Fbs: Nominal
* Restecg: Nominal
* Thalach: Ratio
* Exang: Nominal
* Oldpeak: Ratio
* Slope: Nominal
* Ca: Nominal
* Thal: Nominal
* Target: Nominal

<a id = "26"></a><br>
## 2. Organizing data 
Organizing the data for better understanding.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.hist(data.age, bins=[0, 10,20, 30,40,50,60,70,80,90,100])

### We can say that most of our samples are between 50-60.

In [None]:
data[["sex","target"]].groupby(["sex"], as_index = False).mean().sort_values(by="target",ascending = False)


**75 percent of those whose gender is female are sick, and 44 percent of those whose gender is male are sick.**

In [None]:
data[["cp","target"]].groupby(["cp"], as_index = False).mean().sort_values(by="target",ascending = False)


**Those with chest pain type 1 have a higher rate of getting sick**

<a id = "27"></a><br>
## 3. Representation of Data
In order to better understand the data, the correct graph selection should be made.

**An example of wrong chart selection**

In [None]:
fig, ax = plt.subplots()

ax.scatter(data.age.index,data.age)
plt.ylabel("Age")
plt.xlabel("index")
plt.show()

**Correct representation**

In [None]:
import seaborn as sns
from matplotlib import pyplot
a4_dims = (18, 8)
fig, ax = pyplot.subplots(figsize=a4_dims)
sns.countplot(x='age',hue='target',data=data, linewidth=1,ax=ax)


<a id = "28"></a><br>
## 4. Analyzing and interpreting data

In [None]:
sns.set_style("whitegrid")
a4_dims = (18, 8)
fig, ax = pyplot.subplots(figsize=a4_dims)
sns.countplot(data.sex,hue='target',data=data, linewidth=1,ax=ax)
ax.set(xlabel='1=male - 0=female', ylabel='Count')
print(data.sex.value_counts())

### Although the number of women in the dataset is little, the disease is more common when the rates are examined.

print(data.cp.value_counts())
pd.crosstab(data.cp,data.target).plot(kind="barh",figsize=(15,7),color=['#0000ff','#000000'])
plt.title('Chest pain type and target distribution')
plt.xlabel('Frequency')
plt.ylabel('Chest pain type')
plt.show()


In [None]:
print(data.cp.value_counts())
pd.crosstab(data.cp,data.target).plot(kind="barh",figsize=(15,7),color=['#0000ff','#000000'])
plt.title('Chest pain type and target distribution')
plt.xlabel('Frequency')
plt.ylabel('Chest pain type')
plt.show()


### The rate of disease is higher in patients with chest pain type 2.

<a id = "29"></a><br>
# 7. Population and Sample

### A *population* is the entire group that you want to draw conclusions about.

### A *sample* is the specific group that you will collect data from. The size of the sample is always less than the total size of the population.

![sample-size-definition.png](attachment:sample-size-definition.png)


### Let's write an example for the variable age of the data set.

In [None]:
np.random.seed(10)
sample=np.random.choice(a=data.age, size=100) #we choose 100 random value from age column
print("Sample mean : {0}".format(sample.mean())) #sample mean
print("Population Mean : {:.2f}".format(data.age.mean())) #population mean

### the sample we selected represented the population with success.

<a id = "30"></a><br>
# 8. Confidence interval:
A range of two numbers that can satisfy the estimated value of the population parameter should be selected. Lets try for cholestoral .


In [None]:
import statsmodels.stats.api as sms
sms.DescrStatsW(data.chol).tconfint_mean()

### Someone you choose from the population is among these numbers with 95 percent confidence.

# The first part of the introduction to statistics is over and see you in the second part.

![04981577746d5bfc4c1cd95bd863ebc1.jpg](attachment:04981577746d5bfc4c1cd95bd863ebc1.jpg)