# Descriptive Statistics

Organizing, summarizing, and presenting data in an informative way

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('Titanic.csv')

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


- **Measure of Central Tendency**

**Mean** represents the average of the given collection of data. It is applicable for both continuous and discrete data.

In [4]:
df['Age'].mean()

29.69911764705882

**Median** represents the mid-value of the given set of data when arranged in a particular order

In [5]:
df['Age'].median()

28.0

The most frequent number occurring in the data set is known as the **mode**.

In [6]:
df['Age'].mode().values[0]

24.0

In [None]:
# Task

In [7]:
# data titanic
print('Mean: ', df['Fare'].mean())
print('Median: ', df['Fare'].median())
print('Modus: ', df['Fare'].mode().values[0])

Mean:  32.2042079685746
Median:  14.4542
Modus:  8.05


In [8]:
df2 = pd.read_csv('kelompok2.csv')
df2.head()

Unnamed: 0,Nama,Usia,Domisili,Hobi,Cita-Cita,Hewan Peliharaan
0,Jaza,22,Jogja,Masak,Pengusaha,10
1,Irvan,20,Palembang,Main gitar,Pemusik,7
2,Risma,21,Semarang,Membaca,Data Scientist,8
3,Rizal,21,Bandung,Nonton,Freelancer,4
4,Shafira,21,Surabaya,Travelling,Data Analyst,2


In [9]:
# manual
mean = (22+20+21+21+21)/5
mean

21.0

In [10]:
# numpy
print('Mean: ', df2['Usia'].mean())
print('Median: ', df2['Usia'].median())
print('Modus: ', df2['Usia'].mode().values[0])

Mean:  21.0
Median:  21.0
Modus:  21


- **Measure of Spread / Dispersion**

**Range** : Calculate the difference by subtracting the smallest from the largest

In [11]:
df_age_notnull = df['Age'][df['Age'].notnull()]

In [12]:
df_age_notnull

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
885    39.0
886    27.0
887    19.0
889    26.0
890    32.0
Name: Age, Length: 714, dtype: float64

In [None]:
# numpy

In [13]:
range_age = np.ptp(df_age_notnull)
print(range_age)

79.58


In [None]:
# manual

In [14]:
df_age_notnull.max()

80.0

In [15]:
df_age_notnull.min()

0.42

In [16]:
df_age_notnull.max() - df_age_notnull.min()

79.58

In [None]:
# Task

In [17]:
df_fare_notnull = df['Fare'][df['Fare'].notnull()]
df_fare_notnull

0       7.2500
1      71.2833
2       7.9250
3      53.1000
4       8.0500
        ...   
886    13.0000
887    30.0000
888    23.4500
889    30.0000
890     7.7500
Name: Fare, Length: 891, dtype: float64

In [18]:
#numpy
range_fare = np.ptp(df_fare_notnull)
print(range_fare)

512.3292


In [19]:
#manual
print('Max: ', df_fare_notnull.max())
print('Min: ', df_fare_notnull.min())
print('Range: ', df_fare_notnull.max()-df_fare_notnull.min())

Max:  512.3292
Min:  0.0
Range:  512.3292


**Variance** estimates how far a set of numbers (random) are spread out from their mean value.

In [20]:
import statistics

In [21]:
variance_age = statistics.variance(df_age_notnull)
print(variance_age)

211.01912474630805


In [None]:
# Task

In [22]:
df2_usia_notnull = df2['Usia'][df2['Usia'].notnull()]

In [23]:
variance_usia = statistics.variance(df2_usia_notnull)
print(variance_usia)

0.5


The **standard deviation** is a measure of the amount of variation or dispersion of a set of values.
- A low standard deviation indicates that the values tend to be close to the mean of the set.
- while a high standard deviation indicates that the values are spread out over a wider range.

In [24]:
import statistics

In [25]:
stdev_age = statistics.stdev(df_age_notnull)
print(stdev_age)

14.526497332334042


In [None]:
# Task

In [26]:
df2_usia_notnull = df2['Usia'][df2['Usia'].notnull()]

In [27]:
stdev_usia = statistics.stdev(df2_usia_notnull)
print(stdev_usia)

0.7071067811865476


**Quantile** is a range from any value to any other value. Note that percentiles and quartiles are simply types of quantiles.
- Q0 is the smallest value in the data
- Q1 is the value separating the first quarter from the second quarter of the data
- Q2 is the middle value (median), separating the bottom from the top half
- Q3 is the value separating the third quarter from the fourth quarter
- Q4 is the largest value in the data

In [28]:
quantile_age = np.quantile(df_age_notnull, [0,0.25,0.5,0.75,1])
print(quantile_age)

[ 0.42  20.125 28.    38.    80.   ]


In [None]:
# Task

In [29]:
#titanic
quantile_fare = np.quantile(df_fare_notnull, [0,0.25,0.5,0.75,1])
print(quantile_fare)

[  0.       7.9104  14.4542  31.     512.3292]


In [30]:
#group2
quantile_usia = np.quantile(df2_usia_notnull, [0,0.25,0.5,0.75,1])
print(quantile_usia)

[20. 21. 21. 21. 22.]
