## Descriptive Statistics - Measures of Central Tendency and variability
Perform the following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard deviation) for 
a dataset (age, income etc.) with numeric variables grouped by one of the qualitative 
(categorical) variable. For example, if your categorical variable is age groups and 
quantitative variable is income, then provide summary statistics of income grouped by the 
age groups. Create a list that contains a numeric value for each response to the categorical 
variable. 

2. Write a Python program to display some basic statistical details like percentile, mean, 
standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-versicolor’ 
of iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.


## Part 1 Loan Dataset

In [46]:
import pandas as pd
import numpy as np
import statistics as st

df = pd.read_csv("loan_data.csv")
print(df.shape)
print(df.head())
print(df.info())

(367, 12)
    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001015   Male     Yes          0      Graduate            No   
1  LP001022   Male     Yes          1      Graduate            No   
2  LP001031   Male     Yes          2      Graduate            No   
3  LP001035   Male     Yes          2      Graduate            No   
4  LP001051   Male      No          0  Not Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5720                  0       110.0             360.0   
1             3076               1500       126.0             360.0   
2             5000               1800       208.0             360.0   
3             2340               2546       100.0             360.0   
4             3276                  0        78.0             360.0   

   Credit_History Property_Area  
0             1.0         Urban  
1             1.0         Urban  
2             1.0         Urban  
3           

In [47]:
df.mean()

  df.mean()


ApplicantIncome      4805.599455
CoapplicantIncome    1569.577657
LoanAmount            136.132597
Loan_Amount_Term      342.537396
Credit_History          0.825444
dtype: float64

In [49]:
df.iloc[:, [6]].mean()

ApplicantIncome    4805.599455
dtype: float64

In [53]:
#axis =1 -> row and axis = 0 -> col
df.mean(axis=1)[0:10]

  df.mean(axis=1)[0:10]


0    1238.2
1    1012.6
2    1473.8
3    1336.5
4     743.0
5    1220.0
6     529.2
7     877.6
8    2830.8
9    1056.8
dtype: float64

In [55]:
df.median(axis=1)[0:100]

  df.median(axis=1)[0:100]


0      110.0
1      360.0
2      360.0
3     1350.0
4       78.0
       ...  
95     110.0
96     360.0
97     360.0
98      61.0
99     274.0
Length: 100, dtype: float64

In [56]:
df.median()

  df.median()


ApplicantIncome      3786.0
CoapplicantIncome    1025.0
LoanAmount            125.0
Loan_Amount_Term      360.0
Credit_History          1.0
dtype: float64

In [57]:
df.mode()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,3500.0,0.0,150.0,360.0,1.0,Urban
1,LP001022,,,,,,5000.0,,,,,
2,LP001031,,,,,,,,,,,
3,LP001035,,,,,,,,,,,
4,LP001051,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
362,LP002971,,,,,,,,,,,
363,LP002975,,,,,,,,,,,
364,LP002980,,,,,,,,,,,
365,LP002986,,,,,,,,,,,


In [58]:
df.std()

  df.std()


ApplicantIncome      4910.685399
CoapplicantIncome    2334.232099
LoanAmount             61.366652
Loan_Amount_Term       65.156643
Credit_History          0.380150
dtype: float64

In [59]:
df.var()

  df.var()


ApplicantIncome      2.411483e+07
CoapplicantIncome    5.448639e+06
LoanAmount           3.765866e+03
Loan_Amount_Term     4.245388e+03
Credit_History       1.445139e-01
dtype: float64

**Measures the Interquartile Range (IQR)**

In [60]:
from scipy.stats import iqr
iqr(df['ApplicantIncome'])

2196.0

In [61]:
print(df.skew())

ApplicantIncome      8.441375
CoapplicantIncome    4.257357
LoanAmount           2.223512
Loan_Amount_Term    -2.679318
Credit_History      -1.722379
dtype: float64


  print(df.skew())


#### The skewness values can be interpreted in the following manner:

#### Highly skewed distribution: If the skewness value is less than −1 or greater than +1.

#### Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.

#### Approximately symmetric distribution: If the skewness value is between −½ and +½.

In [62]:
df.groupby('ApplicantIncome').count()

Unnamed: 0_level_0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
ApplicantIncome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,2,2,2,2,2,2,2,2,2,2,2
570,1,0,1,1,1,1,1,1,1,1,1
724,1,1,1,1,1,1,1,1,1,1,1
1141,1,1,1,1,1,1,1,1,1,1,1
1173,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...
18840,1,1,1,1,1,1,1,1,1,1,1
24797,1,1,1,1,1,1,1,1,1,1,1
29167,1,0,1,1,1,1,1,1,1,1,1
32000,1,1,1,1,1,1,1,1,1,0,1


In [None]:
df.groupby('Age')['ApplicantIncome']
df.groupby('Age')['ApplicantIncome'].sum()

# Part 2 Iris Dataset

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn import datasets

**loading the dataset from sklearn**

In [2]:
iris = datasets.load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [8]:
#creating the dataframe of iris dataset
df = pd.DataFrame(iris['data'])
df.head()

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [9]:
#loading the target frame
df[4] = iris['target']
df.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [11]:
#Adding column names
df.rename(columns={0:'SepalLengthCm', 1:'SepalWidthCm', 2:'PetalLengthCm', 3:'PetalWidthCm', 4:'Species'}, inplace=True)
df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [12]:
df.describe()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [13]:
df.mean()

SepalLengthCm    5.843333
SepalWidthCm     3.057333
PetalLengthCm    3.758000
PetalWidthCm     1.199333
Species          1.000000
dtype: float64

In [14]:
df.median()

SepalLengthCm    5.80
SepalWidthCm     3.00
PetalLengthCm    4.35
PetalWidthCm     1.30
Species          1.00
dtype: float64

In [15]:
df.Species.mode()

0    0
1    1
2    2
Name: Species, dtype: int32

In [16]:
df.groupby(['Species']).count()

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,50,50,50,50
1,50,50,50,50
2,50,50,50,50


In [20]:
df['Species'].value_counts()

0    50
1    50
2    50
Name: Species, dtype: int64

### STANDARD DEVIATION

In [17]:
df.SepalLengthCm.std()

0.8280661279778629

In [18]:
df.SepalWidthCm.std()

0.435866284936698

In [30]:
data = pd.read_csv('iris.csv')
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [35]:
data.groupby('Species').var()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,212.5,0.124249,0.14518,0.030106,0.011494
Iris-versicolor,212.5,0.266433,0.098469,0.220816,0.039106
Iris-virginica,212.5,0.404343,0.104004,0.304588,0.075433


In [36]:
data.groupby('Species').mean()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,25.5,5.006,3.418,1.464,0.244
Iris-versicolor,75.5,5.936,2.77,4.26,1.326
Iris-virginica,125.5,6.588,2.974,5.552,2.026


In [37]:
data.groupby('Species').median()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,25.5,5.0,3.4,1.5,0.2
Iris-versicolor,75.5,5.9,2.8,4.35,1.3
Iris-virginica,125.5,6.5,3.0,5.55,2.0


In [39]:
data.groupby('Species').std()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,14.57738,0.35249,0.381024,0.173511,0.10721
Iris-versicolor,14.57738,0.516171,0.313798,0.469911,0.197753
Iris-virginica,14.57738,0.63588,0.322497,0.551895,0.27465


In [40]:
data.groupby('Species').min()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,1,4.3,2.3,1.0,0.1
Iris-versicolor,51,4.9,2.0,3.0,1.0
Iris-virginica,101,4.9,2.2,4.5,1.4


In [41]:
data.groupby('Species').max()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,50,5.8,4.4,1.9,0.6
Iris-versicolor,100,7.0,3.4,5.1,1.8
Iris-virginica,150,7.9,3.8,6.9,2.5


In [42]:
data.groupby('Species').quantile()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,25.5,5.0,3.4,1.5,0.2
Iris-versicolor,75.5,5.9,2.8,4.35,1.3
Iris-virginica,125.5,6.5,3.0,5.55,2.0


In [43]:
data.groupby("Species").count()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Iris-setosa,50,50,50,50,50
Iris-versicolor,50,50,50,50,50
Iris-virginica,50,50,50,50,50


In [44]:
data['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)