# **🔀 Day 5 – Sorting & Basic Statistics in Pandas 📈🐼**

#### **Goal:** Learn how to rearrange data efficiently using sorting and master the use of fundamental statistical methods to quickly summarize and describe a dataset.

#### **Topics To Cover:** Sorting by Values (sort_values) and Index (sort_index), Descriptive Statistics (describe), Aggregation (sum, mean, max), and Frequency Counts (value_counts).

----

## **Introduction to Data Ordering and Summarization 🧠**
Data analysis often begins with two fundamental steps: ordering the data for quick inspection and summarizing it to get a feel for its distribution.

* **Sorting** means the rearrangement of rows or columns in a DataFrame based on a specific, determined order.

* **Basic Statistics** refers to the functions that calculate simple quantitative measures (like mean, median, count) to describe the main features of a collection of data.

### **Why are Sorting and Statistics Important? 💡**
* **Readability & Insight:** Sorting makes data organized and intuitive. For example, sorting products by price immediately reveals the cheapest and most expensive items.

* **Error Detection:** Sorting can bring extreme values (outliers) to the top or bottom, making them easy to spot.

* **Data Quality Check:** Descriptive statistics help verify if the data falls within expected ranges (e.g., checking if an 'Age' column has a logical minimum and maximum).

* **Foundational Knowledge:** These methods are the building blocks for all subsequent advanced data analysis and visualization.

### **Statistics and Its need in Pandas:**

**Statistics:** It is the discipline of collecting, analyzing, and interpreting data to reveal patterns. For an AI/ML student using pandas, it's the crucial first step, acting like a doctor's initial examination of a patient. Just as the doctor needs to understand a patient's vital signs to make a proper diagnosis, a student uses statistics to understand their data's health—identifying central tendencies, spread, and anomalies—before building a model. This pre-analysis ensures the data is clean and suitable for training, directly impacting the model's performance and reliability.

***

### Let's Begin 💻

In [1]:
# import libraries
import pandas as pd
import numpy as np

# load the data
data = pd.read_csv(r'../data/BMW sales data (2010-2024) (1).csv')
df = pd.DataFrame(data)
df

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,5 Series,2016,Asia,Red,Petrol,Manual,3.5,151748,98740,8300,High
1,i8,2013,North America,Red,Hybrid,Automatic,1.6,121671,79219,3428,Low
2,5 Series,2022,North America,Blue,Petrol,Automatic,4.5,10991,113265,6994,Low
3,X3,2024,Middle East,Blue,Petrol,Automatic,1.7,27255,60971,4047,Low
4,7 Series,2020,South America,Black,Diesel,Manual,2.1,122131,49898,3080,Low
...,...,...,...,...,...,...,...,...,...,...,...
49995,i3,2014,Asia,Red,Hybrid,Manual,4.6,151030,42932,8182,High
49996,i3,2023,Middle East,Silver,Electric,Manual,4.2,147396,48714,9816,High
49997,5 Series,2010,Middle East,Red,Petrol,Automatic,4.5,174939,46126,8280,High
49998,i3,2020,Asia,White,Electric,Automatic,3.8,3379,58566,9486,High


***

## **5.1: Sorting DataFrames ↕️**

Pandas provides two main methods for ordering your data: 
* sorting by the values in a column.

* sorting by the index labels.

### **5.1.1. Sorting by Values:**
**`df.sort_values()`:** This is the most common sorting technique. It reorders the DataFrame rows based on the contents of one or more specified columns.

#### Key Parameters for `sort_values()`:
<table border="1">
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Description</th>
      <th>Default Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>by</code></td>
      <td>The column name(s) to sort by (required). Can be a single string or a list of strings for multi-level sorting.</td>
      <td>(None / Required)</td>
    </tr>
    <tr>
      <td><code>axis</code></td>
      <td><code>0</code> for sorting by row values (column contents), <code>1</code> for sorting by column index/name.</td>
      <td><code>0</code> (sorts rows)</td>
    </tr>
    <tr>
      <td><code>ascending</code></td>
      <td>Boolean (<code>True</code> for A–Z / 0–9; <code>False</code> for Z–A / 9–0). Can be a list of Booleans for multi-level sorting.</td>
      <td><code>True</code></td>
    </tr>
    <tr>
      <td><code>inplace</code></td>
      <td>If <code>True</code>, the DataFrame is modified in place and <code>None</code> is returned.</td>
      <td><code>False</code></td>
    </tr>
    <tr>
      <td><code>na_position</code></td>
      <td>Controls where NaN values are placed: <code>'first'</code> or <code>'last'</code>.</td>
      <td><code>'last'</code></td>
    </tr>
  </tbody>
</table>


In [2]:
# Sort BMW car sales by Price_USD column (ascending)
df_sort_asc = df.sort_values(by='Price_USD')
df_sort_asc

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
3762,i8,2013,Asia,Grey,Electric,Automatic,2.5,165642,30000,7782,High
19031,X5,2022,Asia,Blue,Diesel,Manual,4.8,100254,30001,5035,Low
26405,5 Series,2015,Africa,White,Hybrid,Manual,2.4,190634,30002,5469,Low
13439,M5,2018,Asia,Black,Hybrid,Manual,4.9,27843,30005,6316,Low
3264,3 Series,2021,Middle East,Red,Hybrid,Manual,4.1,54636,30008,5554,Low
...,...,...,...,...,...,...,...,...,...,...,...
6271,3 Series,2019,Middle East,White,Hybrid,Manual,4.0,12264,119994,3259,Low
154,X1,2016,Africa,Grey,Petrol,Manual,4.1,172950,119996,9620,High
38158,X6,2019,Asia,Red,Electric,Automatic,3.3,142419,119997,4575,Low
6862,i8,2024,Africa,Silver,Diesel,Automatic,4.1,163849,119997,9250,High


In [3]:
# Sort BMW car sales by Price_USD column (ascending)
df_sort_dsc = df.sort_values(by='Price_USD', ascending=False)
df_sort_dsc

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
26071,i8,2010,Middle East,Silver,Electric,Manual,4.2,115320,119998,5842,Low
38158,X6,2019,Asia,Red,Electric,Automatic,3.3,142419,119997,4575,Low
6862,i8,2024,Africa,Silver,Diesel,Automatic,4.1,163849,119997,9250,High
154,X1,2016,Africa,Grey,Petrol,Manual,4.1,172950,119996,9620,High
6271,3 Series,2019,Middle East,White,Hybrid,Manual,4.0,12264,119994,3259,Low
...,...,...,...,...,...,...,...,...,...,...,...
3264,3 Series,2021,Middle East,Red,Hybrid,Manual,4.1,54636,30008,5554,Low
13439,M5,2018,Asia,Black,Hybrid,Manual,4.9,27843,30005,6316,Low
26405,5 Series,2015,Africa,White,Hybrid,Manual,2.4,190634,30002,5469,Low
19031,X5,2022,Asia,Blue,Diesel,Manual,4.8,100254,30001,5035,Low


In [4]:
# Sorting by multiple columns
# sort by Sales_Volume and Price_USD
df_multi_sort = df.sort_values(by=['Sales_Volume', 'Price_USD']) # ascending
df_multi_sort = df.sort_values(by=['Sales_Volume', 'Price_USD'], ascending=False) # descending
df_multi_sort


Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
47953,5 Series,2024,South America,White,Electric,Manual,2.5,52062,88572,9999,High
27113,X3,2011,Africa,Silver,Electric,Manual,2.0,123193,79124,9999,High
13521,7 Series,2024,Europe,Silver,Diesel,Automatic,3.7,50463,71167,9999,High
39518,3 Series,2017,North America,Black,Electric,Automatic,2.7,35357,64641,9999,High
38529,M3,2022,Europe,Red,Petrol,Manual,4.8,43988,38126,9999,High
...,...,...,...,...,...,...,...,...,...,...,...
26335,5 Series,2024,North America,Blue,Hybrid,Automatic,3.3,80752,46642,101,Low
38420,i3,2021,North America,Silver,Diesel,Automatic,4.1,33516,78548,100,Low
40439,M5,2019,Europe,Red,Hybrid,Automatic,2.0,19028,60042,100,Low
32233,X1,2010,South America,Grey,Electric,Manual,3.6,88222,47593,100,Low


**"The order of columns specified for sorting is crucial. In the code above, the DataFrame is first sorted by 'sales' and then by 'price'. You can confirm this by running the code."**

In [5]:
# Sort DataFrame inplace (not recommended)
# df.sort_values(by='Sales_Volume', inplace=True)

In [6]:
df_multi_sort = df.sort_values(by=['Price_USD', 'Sales_Volume']) # first sort by price then by sales
df_multi_sort

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
3762,i8,2013,Asia,Grey,Electric,Automatic,2.5,165642,30000,7782,High
19031,X5,2022,Asia,Blue,Diesel,Manual,4.8,100254,30001,5035,Low
26405,5 Series,2015,Africa,White,Hybrid,Manual,2.4,190634,30002,5469,Low
13439,M5,2018,Asia,Black,Hybrid,Manual,4.9,27843,30005,6316,Low
3264,3 Series,2021,Middle East,Red,Hybrid,Manual,4.1,54636,30008,5554,Low
...,...,...,...,...,...,...,...,...,...,...,...
6271,3 Series,2019,Middle East,White,Hybrid,Manual,4.0,12264,119994,3259,Low
154,X1,2016,Africa,Grey,Petrol,Manual,4.1,172950,119996,9620,High
38158,X6,2019,Asia,Red,Electric,Automatic,3.3,142419,119997,4575,Low
6862,i8,2024,Africa,Silver,Diesel,Automatic,4.1,163849,119997,9250,High


In [7]:
df_multi_sort = df.sort_values(by=['Year', 'Model'])
df_multi_sort

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
146,3 Series,2010,Asia,Black,Petrol,Manual,2.1,107572,86660,8650,High
663,3 Series,2010,Africa,Red,Diesel,Automatic,4.1,123015,105114,8248,High
739,3 Series,2010,North America,Black,Diesel,Automatic,1.8,172040,45092,6364,Low
929,3 Series,2010,North America,Silver,Hybrid,Manual,2.5,136265,98452,7885,High
957,3 Series,2010,Middle East,Blue,Diesel,Automatic,1.7,149295,109920,6039,Low
...,...,...,...,...,...,...,...,...,...,...,...
47941,i8,2024,Europe,Silver,Electric,Automatic,3.7,177606,56433,6640,Low
48303,i8,2024,Middle East,White,Diesel,Automatic,3.8,109864,69246,3719,Low
48526,i8,2024,South America,Red,Hybrid,Automatic,2.1,170338,119301,3290,Low
48611,i8,2024,Africa,Silver,Diesel,Automatic,1.9,168709,31598,9537,High


**Sorting with Mixed Orders (Ascending and Descending)**

You can achieve the mixed order sorting by provide the list of columns for `by` parameter and a boolean list for `ascending` parameter. Both list should have same size. The order of the columns in the by list corresponds to the order of the boolean values in the ascending list.

In [8]:
# Sort ascending by mileage, descending by price and ascending by sales
df_mix_sort = df.sort_values(by=['Mileage_KM', 'Price_USD', 'Sales_Volume'], ascending=[True, False, True])
df_mix_sort

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
5291,i3,2010,Africa,White,Petrol,Manual,2.8,3,93933,5336,Low
7780,5 Series,2018,South America,Blue,Petrol,Automatic,4.5,21,55195,9860,High
23362,7 Series,2015,North America,Silver,Hybrid,Automatic,2.4,23,78427,348,Low
17180,X5,2017,Asia,Silver,Petrol,Automatic,3.4,29,65476,6454,Low
20924,3 Series,2016,Europe,Red,Petrol,Automatic,2.7,36,114661,5912,Low
...,...,...,...,...,...,...,...,...,...,...,...
22741,X5,2021,Asia,Blue,Hybrid,Automatic,3.8,199979,106046,5681,Low
48848,i3,2020,Middle East,Black,Petrol,Automatic,1.6,199987,99357,1721,Low
3681,7 Series,2024,Middle East,White,Electric,Manual,3.0,199991,35172,2178,Low
17872,X5,2016,Africa,Grey,Diesel,Manual,3.8,199995,111226,2362,Low


### **5.1.2. Sorting by Index:**
**`df.sort_index()`:** This method reorders the rows of the DataFrame based on the values in the row labels (the index).

#### Key Parameters for `sort_index()`:
<table border="1">
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Description</th>
      <th>Default Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>axis</code></td>
      <td><code>0</code> for sorting the row index, <code>1</code> for sorting the column index (column names).</td>
      <td><code>0</code> (sorts rows)</td>
    </tr>
    <tr>
      <td><code>level</code></td>
      <td>If working with a MultiIndex, specifies the level(s) to sort.</td>
      <td><code>None</code></td>
    </tr>
    <tr>
      <td><code>ascending</code></td>
      <td><code>True</code> for ascending order of index labels; <code>False</code> for descending.</td>
      <td><code>True</code></td>
    </tr>
    <tr>
      <td><code>inplace</code></td>
      <td>If <code>True</code>, the DataFrame is modified in place and <code>None</code> is returned.</td>
      <td><code>False</code></td>
    </tr>
  </tbody>
</table>


In [9]:
df_row_sort = df.sort_index()
df_row_sort

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,5 Series,2016,Asia,Red,Petrol,Manual,3.5,151748,98740,8300,High
1,i8,2013,North America,Red,Hybrid,Automatic,1.6,121671,79219,3428,Low
2,5 Series,2022,North America,Blue,Petrol,Automatic,4.5,10991,113265,6994,Low
3,X3,2024,Middle East,Blue,Petrol,Automatic,1.7,27255,60971,4047,Low
4,7 Series,2020,South America,Black,Diesel,Manual,2.1,122131,49898,3080,Low
...,...,...,...,...,...,...,...,...,...,...,...
49995,i3,2014,Asia,Red,Hybrid,Manual,4.6,151030,42932,8182,High
49996,i3,2023,Middle East,Silver,Electric,Manual,4.2,147396,48714,9816,High
49997,5 Series,2010,Middle East,Red,Petrol,Automatic,4.5,174939,46126,8280,High
49998,i3,2020,Asia,White,Electric,Automatic,3.8,3379,58566,9486,High


In [10]:
df_col_sort = df.sort_index(axis=1)
df_col_sort

Unnamed: 0,Color,Engine_Size_L,Fuel_Type,Mileage_KM,Model,Price_USD,Region,Sales_Classification,Sales_Volume,Transmission,Year
0,Red,3.5,Petrol,151748,5 Series,98740,Asia,High,8300,Manual,2016
1,Red,1.6,Hybrid,121671,i8,79219,North America,Low,3428,Automatic,2013
2,Blue,4.5,Petrol,10991,5 Series,113265,North America,Low,6994,Automatic,2022
3,Blue,1.7,Petrol,27255,X3,60971,Middle East,Low,4047,Automatic,2024
4,Black,2.1,Diesel,122131,7 Series,49898,South America,Low,3080,Manual,2020
...,...,...,...,...,...,...,...,...,...,...,...
49995,Red,4.6,Hybrid,151030,i3,42932,Asia,High,8182,Manual,2014
49996,Silver,4.2,Electric,147396,i3,48714,Middle East,High,9816,Manual,2023
49997,Red,4.5,Petrol,174939,5 Series,46126,Middle East,High,8280,Automatic,2010
49998,White,3.8,Electric,3379,i3,58566,Asia,High,9486,Automatic,2020


***

## **5.2** Ranking 📊

Ranking is the process of assigning a numerical rank to each row in a DataFrame based on the values in a specific column. It determines the relative position of each item within a sorted order.

### **Why to rank dataframe and real life usecase**

In your AI/ML career, ranking a DataFrame is a critical step for several common use cases, primarily for **feature engineering** and **data analysis**. It helps you transform raw data into a format that a machine learning model can better understand or use to uncover insights.

#### Common Use Cases

1.  **Recommender Systems**: Ranking is fundamental in building a recommendation engine. For example, you can rank products by their sales, ratings, or popularity to recommend the top-selling or most highly-rated items to users. 

2.  **Performance Analysis**: You can rank different models, features, or experiments based on their performance metrics (e.g., accuracy, F1-score) to easily identify the best performers. This is crucial during the model selection and hyperparameter tuning phases.

3.  **Anomaly and Outlier Detection**: By ranking data points, you can quickly identify extreme values. For instance, ranking a customer's spending and looking at the top 1% can reveal potential high-value customers or fraudulent transactions.

4.  **Feature Engineering**: Ranking can be used to create new features that represent the relative position of a data point, rather than its absolute value. A model might find a product's sales rank more useful than its raw sales number, as the rank provides a sense of its performance relative to other products in the dataset. This can help normalize the data and improve model performance.

5.  **Competitive Analysis**: If you have a dataset of competitors, you can rank them by various metrics like market share, customer satisfaction scores, or growth rate to understand your company's position in the market. 

### **5.2.1 rank() method**
`rank()`: This method is used to compute the numerical data ranks for a DataFrame or Series. It's a useful tool for understanding the relative standing of each data point. By default, it assigns a rank of 1 to the smallest value and increases the rank for larger values.

**Key parameters:** `axis`, `method`, `numeric_only`, `na_option`, `ascending` and `pct`

*Using* `axis` *parameter*

In [11]:
# by default axis is set 0 (by row) for column set axis=1
df_rank_axis0 = df.rank()
df_rank_axis0
df_rank_axis1 = df.rank(axis=1, numeric_only=True) # numeric_only is necessary if your dataframe contains columns with object datatype
df_rank_axis1

Unnamed: 0,Year,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume
0,2.0,1.0,5.0,4.0,3.0
1,2.0,1.0,5.0,4.0,3.0
2,2.0,1.0,4.0,5.0,3.0
3,2.0,1.0,4.0,5.0,3.0
4,2.0,1.0,5.0,4.0,3.0
...,...,...,...,...,...
49995,2.0,1.0,5.0,4.0,3.0
49996,2.0,1.0,5.0,4.0,3.0
49997,2.0,1.0,5.0,4.0,3.0
49998,2.0,1.0,3.0,5.0,4.0


*Using* `method` *parameter*

In [12]:

# method: Handles how ties (equal values) are ranked.
# 'average' (default): Assigns the average rank to each tied value.
df_avg = df.rank()
# df_avg = df['Price_USD'].rank()
df_avg

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,6891.5,21657.0,12480.5,29115.0,43725.5,37423.5,28588.5,37766.0,38129.0,41404.5,7623.5
1,47697.5,11603.5,37582.0,29115.0,31092.5,12423.5,1426.5,30245.0,27322.0,16622.0,32623.5
2,6891.5,41615.5,37582.0,12404.5,43725.5,12423.5,42988.0,2733.0,46203.0,34730.5,32623.5
3,29563.0,48287.0,29228.0,12404.5,43725.5,12423.5,2851.5,6901.0,17258.0,19734.5,32623.5
4,11520.5,34902.0,45875.0,4137.0,6132.0,37423.5,8594.5,30352.0,11044.0,14856.0,32623.5
...,...,...,...,...,...,...,...,...,...,...,...
49995,43085.5,14941.5,12480.5,29115.0,31092.5,37423.5,44428.0,37585.0,7160.5,40799.5,7623.5
49996,43085.5,44964.0,29228.0,37521.5,18499.0,37423.5,38640.5,36716.0,10369.0,49080.5,7623.5
49997,6891.5,1665.5,29228.0,29115.0,43725.5,12423.5,42988.0,43600.5,8882.0,41301.5,7623.5
49998,43085.5,34902.0,12480.5,45848.5,18499.0,12423.5,32887.0,858.0,15863.5,47345.5,7623.5


In [13]:

# 'min': Assigns the minimum rank in the group of tied values.
df_min = df.rank(method='min')
df_min

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,4596.0,19975.0,8254.0,24884.0,37451.0,24847.0,27864.0,37766.0,38128.0,41400.0,1.0
1,45395.0,9941.0,33415.0,24884.0,24735.0,1.0,711.0,30245.0,27321.0,16621.0,15247.0
2,4596.0,39877.0,33415.0,8274.0,37451.0,1.0,42251.0,2733.0,46202.0,34729.0,15247.0
3,27315.0,46574.0,25042.0,8274.0,37451.0,1.0,2143.0,6901.0,17258.0,19732.0,15247.0
4,9188.0,33300.0,41750.0,1.0,1.0,24847.0,7871.0,30352.0,11044.0,14853.0,15247.0
...,...,...,...,...,...,...,...,...,...,...,...
49995,40777.0,13267.0,8254.0,24884.0,24735.0,24847.0,43726.0,37585.0,7160.0,40796.0,1.0
49996,40777.0,43355.0,25042.0,33347.0,12264.0,24847.0,37907.0,36716.0,10369.0,49079.0,1.0
49997,4596.0,1.0,25042.0,24884.0,37451.0,1.0,42251.0,43600.0,8881.0,41299.0,1.0
49998,40777.0,33300.0,8254.0,41697.0,12264.0,1.0,32132.0,858.0,15863.0,47344.0,1.0


In [14]:

# 'max': Assigns the maximum rank.
df_max = df.rank(method='max')
df_max

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,9187.0,23339.0,16707.0,33346.0,50000.0,50000.0,29313.0,37766.0,38130.0,41409.0,15246.0
1,50000.0,13266.0,41749.0,33346.0,37450.0,24846.0,2142.0,30245.0,27323.0,16623.0,50000.0
2,9187.0,43354.0,41749.0,16535.0,50000.0,24846.0,43725.0,2733.0,46204.0,34732.0,50000.0
3,31811.0,50000.0,33414.0,16535.0,50000.0,24846.0,3560.0,6901.0,17258.0,19737.0,50000.0
4,13853.0,36504.0,50000.0,8273.0,12263.0,50000.0,9318.0,30352.0,11044.0,14859.0,50000.0
...,...,...,...,...,...,...,...,...,...,...,...
49995,45394.0,16616.0,16707.0,33346.0,37450.0,50000.0,45130.0,37585.0,7161.0,40803.0,15246.0
49996,45394.0,46573.0,33414.0,41696.0,24734.0,50000.0,39374.0,36716.0,10369.0,49082.0,15246.0
49997,9187.0,3330.0,33414.0,33346.0,50000.0,24846.0,43725.0,43601.0,8883.0,41304.0,15246.0
49998,45394.0,36504.0,16707.0,50000.0,24734.0,24846.0,33642.0,858.0,15864.0,47347.0,15246.0


In [15]:

# 'first': Ranks are assigned according to their order of appearance in the original DataFrame.
df_first = df.rank(method='first')
df_first

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,4596.0,19975.0,8254.0,24884.0,37451.0,24847.0,27864.0,37766.0,38128.0,41400.0,1.0
1,45395.0,9941.0,33415.0,24885.0,24735.0,1.0,711.0,30245.0,27321.0,16621.0,15247.0
2,4597.0,39877.0,33416.0,8274.0,37452.0,2.0,42251.0,2733.0,46202.0,34729.0,15248.0
3,27315.0,46574.0,25042.0,8275.0,37453.0,3.0,2143.0,6901.0,17258.0,19732.0,15249.0
4,9188.0,33300.0,41750.0,1.0,1.0,24848.0,7871.0,30352.0,11044.0,14853.0,15250.0
...,...,...,...,...,...,...,...,...,...,...,...
49995,45392.0,16616.0,16706.0,33345.0,37450.0,49998.0,45130.0,37585.0,7161.0,40803.0,15243.0
49996,45393.0,46573.0,33413.0,41696.0,24733.0,49999.0,39374.0,36716.0,10369.0,49082.0,15244.0
49997,9187.0,3330.0,33414.0,33346.0,50000.0,24845.0,43725.0,43601.0,8883.0,41304.0,15245.0
49998,45394.0,36503.0,16707.0,50000.0,24734.0,24846.0,33642.0,858.0,15864.0,47347.0,15246.0


In [16]:

# 'dense': Like 'min', but ranks are always consecutive integers without gaps.
df_dense = df.rank(method='dense')
df_dense

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,2.0,7.0,2.0,4.0,4.0,2.0,21.0,33536.0,29229.0,8153.0,1.0
1,11.0,4.0,5.0,4.0,3.0,1.0,2.0,26872.0,20951.0,3308.0,2.0
2,2.0,13.0,5.0,2.0,4.0,1.0,31.0,2422.0,35379.0,6855.0,2.0
3,7.0,15.0,4.0,2.0,4.0,1.0,3.0,6113.0,13284.0,3923.0,2.0
4,3.0,11.0,6.0,1.0,1.0,2.0,7.0,26966.0,8549.0,2960.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...
49995,10.0,5.0,2.0,4.0,3.0,2.0,32.0,33378.0,5523.0,8036.0,1.0
49996,10.0,14.0,4.0,5.0,2.0,2.0,28.0,32595.0,8010.0,9664.0,1.0
49997,2.0,1.0,4.0,4.0,4.0,1.0,31.0,38697.0,6852.0,8133.0,1.0
49998,10.0,11.0,2.0,6.0,2.0,1.0,24.0,756.0,12249.0,9334.0,1.0


*Using* `ascending` *parameter*

In [17]:
# by default ascending is True so If you want to rank descendingly then set ascending=False
df_price = df['Price_USD'].rank()
df_price
# rank Price_USD in descending order
df_price = df['Price_USD'].rank(ascending=False) # this means the highest value will get the lowest rank
df_price

0        11872.0
1        22679.0
2         3798.0
3        32743.0
4        38957.0
          ...   
49995    42840.5
49996    39632.0
49997    41119.0
49998    34137.5
49999    23627.0
Name: Price_USD, Length: 50000, dtype: float64

*Using* `numeric_only` *parameter*

In [18]:
# Ranks columns with numeric data only
df_num = df.rank(numeric_only=True)
df_num

Unnamed: 0,Year,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume
0,21657.0,28588.5,37766.0,38129.0,41404.5
1,11603.5,1426.5,30245.0,27322.0,16622.0
2,41615.5,42988.0,2733.0,46203.0,34730.5
3,48287.0,2851.5,6901.0,17258.0,19734.5
4,34902.0,8594.5,30352.0,11044.0,14856.0
...,...,...,...,...,...
49995,14941.5,44428.0,37585.0,7160.5,40799.5
49996,44964.0,38640.5,36716.0,10369.0,49080.5
49997,1665.5,42988.0,43600.5,8882.0,41301.5
49998,34902.0,32887.0,858.0,15863.5,47345.5


*Using* `na_option` *parameter*

In [19]:
# this parameter is used to handle NaN values, it has two values 'top' and 'bottom'
# using 'top'
df_nan_top = df.rank(na_option='top') # NaN values will get the highest rank
df_nan_top
df_nan_bottom = df.rank(na_option='bottom') # NaN values will get the lowest rank
df_nan_bottom

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,6891.5,21657.0,12480.5,29115.0,43725.5,37423.5,28588.5,37766.0,38129.0,41404.5,7623.5
1,47697.5,11603.5,37582.0,29115.0,31092.5,12423.5,1426.5,30245.0,27322.0,16622.0,32623.5
2,6891.5,41615.5,37582.0,12404.5,43725.5,12423.5,42988.0,2733.0,46203.0,34730.5,32623.5
3,29563.0,48287.0,29228.0,12404.5,43725.5,12423.5,2851.5,6901.0,17258.0,19734.5,32623.5
4,11520.5,34902.0,45875.0,4137.0,6132.0,37423.5,8594.5,30352.0,11044.0,14856.0,32623.5
...,...,...,...,...,...,...,...,...,...,...,...
49995,43085.5,14941.5,12480.5,29115.0,31092.5,37423.5,44428.0,37585.0,7160.5,40799.5,7623.5
49996,43085.5,44964.0,29228.0,37521.5,18499.0,37423.5,38640.5,36716.0,10369.0,49080.5,7623.5
49997,6891.5,1665.5,29228.0,29115.0,43725.5,12423.5,42988.0,43600.5,8882.0,41301.5,7623.5
49998,43085.5,34902.0,12480.5,45848.5,18499.0,12423.5,32887.0,858.0,15863.5,47345.5,7623.5


*Using* `pct` *parameter*

In [20]:
# This parameter is used in normalise ranks to values between 0 and 1.
df_pct = df.rank(pct=True)
df_pct

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,0.13783,0.43314,0.24961,0.58230,0.87451,0.74847,0.57177,0.75532,0.76258,0.82809,0.15247
1,0.95395,0.23207,0.75164,0.58230,0.62185,0.24847,0.02853,0.60490,0.54644,0.33244,0.65247
2,0.13783,0.83231,0.75164,0.24809,0.87451,0.24847,0.85976,0.05466,0.92406,0.69461,0.65247
3,0.59126,0.96574,0.58456,0.24809,0.87451,0.24847,0.05703,0.13802,0.34516,0.39469,0.65247
4,0.23041,0.69804,0.91750,0.08274,0.12264,0.74847,0.17189,0.60704,0.22088,0.29712,0.65247
...,...,...,...,...,...,...,...,...,...,...,...
49995,0.86171,0.29883,0.24961,0.58230,0.62185,0.74847,0.88856,0.75170,0.14321,0.81599,0.15247
49996,0.86171,0.89928,0.58456,0.75043,0.36998,0.74847,0.77281,0.73432,0.20738,0.98161,0.15247
49997,0.13783,0.03331,0.58456,0.58230,0.87451,0.24847,0.85976,0.87201,0.17764,0.82603,0.15247
49998,0.86171,0.69804,0.24961,0.91697,0.36998,0.24847,0.65774,0.01716,0.31727,0.94691,0.15247


***

## **5.3. Basic Statistics with Pandas🐼**

Statistics is the discipline of collecting, analyzing, and interpreting data to understand a population or phenomenon. In pandas, we use a range of built-in methods to perform descriptive statistics, which helps summarize and describe the main features of a dataset.

### Descriptive Statistics Methods in Pandas

| Method | Description |
| :--- | :--- |
| `.count()` | Counts non-null values. |
| `.sum()` | Calculates the sum of values. |
| `.mean()` | Computes the average value. |
| `.median()` | Finds the middle value. |
| `.mode()` | Returns the most frequent value(s). |
| `.min()` | Finds the minimum value. |
| `.max()` | Finds the maximum value. |
| `.std()` | Calculates the standard deviation. |
| `.var()` | Computes the variance. |
| `.describe()` | Generates a summary of key statistics. |
| `.quantile()` | Calculates the value at a specific quantile (e.g., 0.25 for the first quartile). |
| `.idxmax()` | Returns the index of the first occurrence of the maximum value. |
| `.idxmin()` | Returns the index of the first occurrence of the minimum value. |


In [21]:
df.count()

Model                   50000
Year                    50000
Region                  50000
Color                   50000
Fuel_Type               50000
Transmission            50000
Engine_Size_L           50000
Mileage_KM              50000
Price_USD               50000
Sales_Volume            50000
Sales_Classification    50000
dtype: int64

In [22]:
df.sum(numeric_only=True)

Year             1.008508e+08
Engine_Size_L    1.623590e+05
Mileage_KM       5.015360e+09
Price_USD        3.751730e+09
Sales_Volume     2.533757e+08
dtype: float64

In [23]:
df.mean(numeric_only=True) # must because object data type can't have mean and we want statistics

Year               2017.01570
Engine_Size_L         3.24718
Mileage_KM       100307.20314
Price_USD         75034.60090
Sales_Volume       5067.51468
dtype: float64

In [24]:
df.median(numeric_only=True)

Year               2017.0
Engine_Size_L         3.2
Mileage_KM       100388.5
Price_USD         75011.5
Sales_Volume       5087.0
dtype: float64

In [25]:
df.mode()

Unnamed: 0,Model,Year,Region,Color,Fuel_Type,Transmission,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume,Sales_Classification
0,7 Series,2022,Asia,Red,Hybrid,Manual,3.8,136842,30948,9502,Low


In [26]:
df.min()

Model                    3 Series
Year                         2010
Region                     Africa
Color                       Black
Fuel_Type                  Diesel
Transmission            Automatic
Engine_Size_L                 1.5
Mileage_KM                      3
Price_USD                   30000
Sales_Volume                  100
Sales_Classification         High
dtype: object

In [27]:
df.max()

Model                              i8
Year                             2024
Region                  South America
Color                           White
Fuel_Type                      Petrol
Transmission                   Manual
Engine_Size_L                     5.0
Mileage_KM                     199996
Price_USD                      119998
Sales_Volume                     9999
Sales_Classification              Low
dtype: object

In [28]:
df.std(numeric_only=True)

Year                 4.324459
Engine_Size_L        1.009078
Mileage_KM       57941.509344
Price_USD        25998.248882
Sales_Volume      2856.767125
dtype: float64

In [29]:
df.var(numeric_only=True)

Year             1.870095e+01
Engine_Size_L    1.018239e+00
Mileage_KM       3.357219e+09
Price_USD        6.759089e+08
Sales_Volume     8.161118e+06
dtype: float64

In [30]:
df.describe()

Unnamed: 0,Year,Engine_Size_L,Mileage_KM,Price_USD,Sales_Volume
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2017.0157,3.24718,100307.20314,75034.6009,5067.51468
std,4.324459,1.009078,57941.509344,25998.248882,2856.767125
min,2010.0,1.5,3.0,30000.0,100.0
25%,2013.0,2.4,50178.0,52434.75,2588.0
50%,2017.0,3.2,100388.5,75011.5,5087.0
75%,2021.0,4.1,150630.25,97628.25,7537.25
max,2024.0,5.0,199996.0,119998.0,9999.0


In [31]:
df.quantile(numeric_only=True)

Year               2017.0
Engine_Size_L         3.2
Mileage_KM       100388.5
Price_USD         75011.5
Sales_Volume       5087.0
Name: 0.5, dtype: float64

In [32]:
df.idxmax()

Model                       1
Year                        3
Region                      4
Color                       6
Fuel_Type                   0
Transmission                0
Engine_Size_L             108
Mileage_KM               1009
Price_USD               26071
Sales_Volume            13521
Sales_Classification        1
dtype: int64

In [33]:
df.idxmin()

Model                      10
Year                       30
Region                     13
Color                       4
Fuel_Type                   4
Transmission                1
Engine_Size_L              51
Mileage_KM               5291
Price_USD                3762
Sales_Volume            10272
Sales_Classification        0
dtype: int64

### Additional Useful Methods

These methods only used on series:

| Method | Description |
| :--- | :--- |
| `.value_counts()` | Counts unique values in a Series. |
| `.unique()` | Returns a list of unique values. |
| `.nunique()` | Counts the number of unique values. |

In [34]:
df['Region'].value_counts()

Region
Asia             8454
Middle East      8373
North America    8335
Europe           8334
Africa           8253
South America    8251
Name: count, dtype: int64

In [35]:
df['Model'].unique()

array(['5 Series', 'i8', 'X3', '7 Series', 'M5', '3 Series', 'X1', 'M3',
       'X5', 'i3', 'X6'], dtype=object)

In [36]:
df['Year'].nunique()

15