Title : Descriptive Statistics - Measures of Central Tendency and variability

Perform the following operations on any open-source dataset (e.g., data.csv) 
1. Provide summary statistics (mean, median, minimum, maximum, standard 
deviation) for a dataset (age, income etc.) with numeric variables grouped by 
one of the qualitative (categorical) variable. For example, if your categorical 
variable is age groups and quantitative variable is income, then provide 
summary statistics of income grouped by the age groups. Create a list that 
contains a numeric value for each response to the categorical variable. 
2. Write a Python program to display some basic statistical details like percentile, 
mean, standarddeviationetc. of thespeciesof ‘Iris-setosa’, ‘Iris-versicolor’
 and‘Iris- verginica’ofiris.csv dataset. 
Provide the codes with outputs and explain everything that you do in this step.

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Load dataset
df = pd.read_csv("nba1.csv")
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [3]:
# Shape = noo of rows, no of columns
df.shape

(458, 9)

In [4]:
df.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [7]:
# Droping a row with index 457 as it has all NULL value
df.drop(457,axis=0,inplace=True)

In [8]:
# Count of NULL values
df.isnull().sum()

Name         0
Team         0
Number       0
Position     0
Age          0
Height       0
Weight       0
College     84
Salary      11
dtype: int64

In [9]:
# Handling NULL values by replacing with constant or mean or median
df["College"]=df["College"].fillna("No College")
df["Salary"]=df["Salary"].fillna(df["Salary"].mean())

In [10]:
df.isnull().sum()

Name        0
Team        0
Number      0
Position    0
Age         0
Height      0
Weight      0
College     0
Salary      0
dtype: int64

In [11]:
# Datatypes
df.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

In [12]:
# Converting datatypes to suitable format
df['Number'] = df['Number'].astype(int)
df['Age'] = df['Age'].astype(int)
df['Weight'] = df['Weight'].astype(int)
df['Salary'] = df['Salary'].astype(int)

In [14]:
# Decimal Scaling
df['Salary']/=10000

In [16]:
# Creating Category on Age 
bins = [15, 20, 25, 30, 35, 40, 45]
labels = ['15-19', '20-24', '25-29', '30-34', '35-39', '40-44']
df['Age groups'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

In [17]:
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Age groups
0,Avery Bradley,Boston Celtics,0,PG,25,6-2,180,Texas,773.0337,25-29
1,Jae Crowder,Boston Celtics,99,SF,25,6-6,235,Marquette,679.6117,25-29
2,John Holland,Boston Celtics,30,SG,27,6-5,205,Boston University,484.2684,25-29
3,R.J. Hunter,Boston Celtics,28,SG,22,6-5,185,Georgia State,114.864,20-24
4,Jonas Jerebko,Boston Celtics,8,PF,29,6-10,231,No College,500.0,25-29


In [18]:
# Count according to age group
df['Age groups'].value_counts()

Age groups
25-29    181
20-24    152
30-34     90
35-39     29
40-44      3
15-19      2
Name: count, dtype: int64

In [20]:
# Median according to age group
df.groupby('Age groups', observed=True)['Salary'].median()

Age groups
15-19    193.04400
20-24    168.60395
25-29    357.89470
30-34    520.95845
35-39    285.49400
40-44    525.00000
Name: Salary, dtype: float64

In [23]:
# Group by 'Age groups' and get summary stats of 'Salary' (or any numeric variable)
grouped_stats = df.groupby('Age groups', observed=True)['Salary'].agg(['mean', 'median', 'min', 'max', 'std'])
grouped_stats

Unnamed: 0_level_0,mean,median,min,max,std
Age groups,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
15-19,193.044,193.044,173.304,212.784,27.916576
20-24,277.264238,168.60395,3.0888,1640.7501,318.428298
25-29,567.930039,357.8947,5.5722,2015.8622,548.695552
30-34,705.955852,520.95845,20.06,2297.05,611.251875
35-39,380.999052,285.494,22.2888,2500.0,459.029781
40-44,466.691667,525.0,25.075,850.0,415.542068


In [25]:
# Listing grouped data
salary_grouped_list = df.groupby('Age groups', observed=True)['Salary'].apply(list)
salary_grouped_list

Age groups
15-19                                   [212.784, 173.304]
20-24    [114.864, 117.096, 182.436, 343.104, 256.926, ...
25-29    [773.0337, 679.6117, 484.2684, 500.0, 1200.0, ...
30-34    [630.0, 800.0, 163.5476, 2287.5, 740.2812, 94....
35-39    [484.2684, 290.0, 567.5, 337.6, 94.7726, 2500....
40-44                               [525.0, 25.075, 850.0]
Name: Salary, dtype: object

In [27]:
#Listing grouped data
list(df.groupby('Age groups', observed=True)['Salary'])

[('15-19',
  122    212.784
  226    173.304
  Name: Salary, dtype: float64),
 ('20-24',
  3       114.864
  6       117.096
  8       182.436
  9       343.104
  10      256.926
           ...   
  446    1200.000
  447     117.588
  449     134.844
  452     223.980
  454      90.000
  Name: Salary, Length: 152, dtype: float64),
 ('25-29',
  0       773.0337
  1       679.6117
  2       484.2684
  4       500.0000
  5      1200.0000
           ...    
  450     205.0000
  451      98.1348
  453     243.3333
  455     290.0000
  456      94.7276
  Name: Salary, Length: 181, dtype: float64),
 ('30-34',
  19      630.0000
  30      800.0000
  31      163.5476
  33     2287.5000
  34      740.2812
           ...    
  405    1210.0000
  415     313.5000
  421     334.4000
  434     501.6000
  440     285.4940
  Name: Salary, Length: 90, dtype: float64),
 ('35-39',
  46      484.2684
  72      290.0000
  93      567.5000
  101     337.6000
  102      94.7726
  109    2500.0000
  119      

In [28]:
#Load new dataset
iris = pd.read_csv("Iris.csv")
iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [29]:
iris.shape

(150, 6)

In [30]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [31]:
# Get the unique species in the dataset
iris['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [35]:
# Basic statistic according to Species
iris.groupby('Species').describe()

Unnamed: 0_level_0,Id,Id,Id,Id,Id,Id,Id,Id,SepalLengthCm,SepalLengthCm,...,PetalLengthCm,PetalLengthCm,PetalWidthCm,PetalWidthCm,PetalWidthCm,PetalWidthCm,PetalWidthCm,PetalWidthCm,PetalWidthCm,PetalWidthCm
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Iris-setosa,50.0,25.5,14.57738,1.0,13.25,25.5,37.75,50.0,50.0,5.006,...,1.575,1.9,50.0,0.244,0.10721,0.1,0.2,0.2,0.3,0.6
Iris-versicolor,50.0,75.5,14.57738,51.0,63.25,75.5,87.75,100.0,50.0,5.936,...,4.6,5.1,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
Iris-virginica,50.0,125.5,14.57738,101.0,113.25,125.5,137.75,150.0,50.0,6.588,...,5.875,6.9,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


In [37]:
# Function to display basic statistics for each species
def species_statistics(species_name):
    print(f"\nStatistics for {species_name}:\n")
    species_data = iris[iris['Species'] == species_name]
    print(species_data.describe())

# Call the function for each species
species_statistics('Iris-setosa')
species_statistics('Iris-versicolor')
species_statistics('Iris-virginica')


Statistics for Iris-setosa:

             Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count  50.00000       50.00000     50.000000      50.000000      50.00000
mean   25.50000        5.00600      3.418000       1.464000       0.24400
std    14.57738        0.35249      0.381024       0.173511       0.10721
min     1.00000        4.30000      2.300000       1.000000       0.10000
25%    13.25000        4.80000      3.125000       1.400000       0.20000
50%    25.50000        5.00000      3.400000       1.500000       0.20000
75%    37.75000        5.20000      3.675000       1.575000       0.30000
max    50.00000        5.80000      4.400000       1.900000       0.60000

Statistics for Iris-versicolor:

              Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count   50.00000      50.000000     50.000000      50.000000     50.000000
mean    75.50000       5.936000      2.770000       4.260000      1.326000
std     14.57738       0.516171      0.313798

In [40]:
iris[iris["Species"] == "Iris-versicolor"]["SepalLengthCm"].describe()

count    50.000000
mean      5.936000
std       0.516171
min       4.900000
25%       5.600000
50%       5.900000
75%       6.300000
max       7.000000
Name: SepalLengthCm, dtype: float64

In [43]:
iris[iris["Species"] == "Iris-setosa"].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,50.0,25.5,14.57738,1.0,13.25,25.5,37.75,50.0
SepalLengthCm,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8
SepalWidthCm,50.0,3.418,0.381024,2.3,3.125,3.4,3.675,4.4
PetalLengthCm,50.0,1.464,0.173511,1.0,1.4,1.5,1.575,1.9
PetalWidthCm,50.0,0.244,0.10721,0.1,0.2,0.2,0.3,0.6


In [44]:
iris[iris["Species"] == "Iris-virginica"].describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,50.0,50.0,50.0,50.0,50.0
mean,125.5,6.588,2.974,5.552,2.026
std,14.57738,0.63588,0.322497,0.551895,0.27465
min,101.0,4.9,2.2,4.5,1.4
25%,113.25,6.225,2.8,5.1,1.8
50%,125.5,6.5,3.0,5.55,2.0
75%,137.75,6.9,3.175,5.875,2.3
max,150.0,7.9,3.8,6.9,2.5
