# Pandas  tutorial for Level 4 Data Analysis



## 1. Setup


### Import

Before moving on to learn pandas first we need to install it and import it. If you install [Anaconda distributions](https://www.anaconda.com/) on your local machine or using [Google Colab](https://research.google.com/colaboratory) then pandas will already be available there, otherwise, you follow this installation process from [pandas official's website](https://pandas.pydata.org/docs/getting_started/install.html).

In [None]:
# Importing libraries
import numpy as np
import pandas as pd

In [None]:
# we can set numbers for how many rows and columns will be displayed
pd.set_option('display.min_rows', 10) #default will be 10
pd.set_option('display.max_columns', 20)

## 2. Loading Different Data Formats Into a Pandas Data Frame




### Reading CSV file


In [None]:
# read csv file

url='https://raw.githubusercontent.com/tdmhub/L4DA_resources/main/world_population_uncleansed.csv'
df = pd.read_csv(url)
df.head(3)

Unnamed: 0,place,pop1980,pop2000,pop2010,pop2022,pop2023,pop2030,pop2050,country,area,landAreaKm,cca2,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
0,140,2415276,3759170,4660067,5579144,5742315,7104274,11533423,,622984,622980.0,CF,CAF,0.0053,0.0292,0.0007,9.2175,23.8733,117
1,268,5145843,4265172,3836831,3744385,3728282,3657494,3384660,,69700,69490.0,GE,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132
2,4,12486631,19542982,28189672,41128771,42239854,50330837,74075234,Afghanistan,652230,652230.0,AF,AFG,0.0357,0.027,0.0053,64.7622,167.7341,36


### Read CSV file from URL

### Read Excel file

In [None]:
# read excel file
url='https://github.com/tdmhub/L4DA_resources/raw/main/medium-and-heavy-duty-vehicles-2023-07-17.xlsx'
df_excel = pd.read_excel(url, sheet_name='Sheet1')
df_excel

Unnamed: 0,Vehicle ID,Model,Manufacturer,Transmission Make,Num Passengers,Power System Ids,Fuels,Application Categories,Transmission Types
0,11099,ACMD-Xpert,Autocar,Allison,,[10013],CNG - Compressed Natural Gas|LNG - Liquified N...,Refuse,
1,10941,ACMD-Xpert,Autocar,Allison,,[10013],CNG - Compressed Natural Gas|LNG - Liquified N...,Street Sweeper,
2,10773,ACMD-Xpert,Autocar,Allison,,[10013],CNG - Compressed Natural Gas|LNG - Liquified N...,Vocational/Cab Chassis,
3,11209,ACTT Terminal Tractor,Autocar,Vorza,,[10562],Electric,Tractor,Automatic
4,11208,ACTT Terminal Tractor,Autocar,Allison,,[10505],CNG - Compressed Natural Gas|LNG - Liquified N...,Tractor,
...,...,...,...,...,...,...,...,...,...
280,11237,VNR Electric - Class 7,Volvo,,,[],Electric,Vocational/Cab Chassis,
281,10848,W4 CC,Workhorse,,,[10589],Electric,Vocational/Cab Chassis,Automatic
282,11243,HDXT - Class 8,Xos,Allison e-Axle,,"[10584, 10585]",Electric,Tractor,Automatic
283,11244,MDXT,Xos,Allison e-Axle,,"[10584, 10585]",Electric,Vocational/Cab Chassis,Automatic


## 3. Data preprocessing
Data preprocessing is the process of making raw data to clean data. This is the most crucial part of data the science. In this section, we will explore data first then we remove unwanted columns, remove duplicates, handle missing data, etc. After this step, we get clean data from raw data.

### 3.1 Data Exploring

#### Retrieving rows from data frame.

In [None]:
# display first 3 rows
df.head(3)


Unnamed: 0,place,pop1980,pop2000,pop2010,pop2022,pop2023,pop2030,pop2050,country,area,landAreaKm,cca2,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
0,140,2415276,3759170,4660067,5579144,5742315,7104274,11533423,,622984,622980.0,CF,CAF,0.0053,0.0292,0.0007,9.2175,23.8733,117
1,268,5145843,4265172,3836831,3744385,3728282,3657494,3384660,,69700,69490.0,GE,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132
2,4,12486631,19542982,28189672,41128771,42239854,50330837,74075234,Afghanistan,652230,652230.0,AF,AFG,0.0357,0.027,0.0053,64.7622,167.7341,36


In [None]:
# display last 6 rows
df.tail(6)

Unnamed: 0,place,pop1980,pop2000,pop2010,pop2022,pop2023,pop2030,pop2050,country,area,landAreaKm,cca2,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
201,862,15210443,24427729,28715022,28301696,28838499,32027461,35937404,Venezuela,916445,882050.0,VE,VEN,0.0179,0.019,0.0036,32.6949,84.6797,52
202,704,52968270,79001142,87411012,98186856,98858950,102699905,107012939,Vietnam,331212,313429.0,VN,VNM,0.0208,0.0068,0.0123,315.411,816.9145,16
203,732,116775,270375,413296,575986,587259,662726,851067,Western Sahara,266000,266000.0,EH,ESH,0.0004,0.0196,0.0001,2.2077,5.718,172
204,887,9204938,18628700,24743946,33696614,34449825,39923245,55296331,Yemen,527968,527970.0,YE,YEM,0.024,0.0224,0.0043,65.2496,168.9964,44
205,894,5720438,9891136,13792086,20017675,20569737,24676417,37460435,Zambia,752612,743390.0,ZM,ZMB,0.0178,0.0276,0.0026,27.6702,71.6658,63
206,716,7049926,11834676,12839771,16320537,16665409,19179393,26438589,Zimbabwe,390757,386850.0,ZW,ZWE,0.0112,0.0211,0.0021,43.0798,111.5766,74


#### Retrieving sample rows from data frame.



In [None]:
# Display random 7 sample rows
df.sample(7)

Unnamed: 0,place,pop1980,pop2000,pop2010,pop2022,pop2023,pop2030,pop2050,country,area,landAreaKm,cca2,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
130,562,6173177,11622665,16647543,26207977,27202843,35217942,67043296,Niger,1267000,1266700.0,NE,NER,0.0321,0.038,0.0034,21.4754,55.6212,54
12,31,6383060,8190337,9237202,10358074,10412651,10711138,10867296,Azerbaijan,86600,82646.0,AZ,AZE,0.0017,0.0053,0.0013,125.991,326.3167,90
49,218,8135845,12626507,14989585,18001000,18190484,19486952,22269779,Ecuador,276841,248360.0,EC,ECU,0.006,0.0105,0.0023,73.2424,189.6978,68
54,233,1476983,1396877,1331535,1326062,1322765,1289441,1171695,Estonia,45227,42750.0,EE,EST,-0.0001,-0.0025,0.0002,30.9419,80.1394,156
1,268,5145843,4265172,3836831,3744385,3728282,3657494,3384660,,69700,69490.0,GE,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132
6,24,8330047,16394062,23364185,35588987,36684202,44911664,72328068,Angola,1246700,1246700.0,AO,AGO,0.0352,0.0308,0.0046,29.425,76.2109,42
19,84,145133,240406,322106,405272,410825,450428,538838,Belize,22966,22810.0,BZ,BLZ,0.0002,0.0137,0.0001,18.0107,46.6478,177


#### Retrieving information about dataframe

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207 entries, 0 to 206
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   place            207 non-null    int64  
 1   pop1980          207 non-null    int64  
 2   pop2000          207 non-null    int64  
 3   pop2010          207 non-null    int64  
 4   pop2022          207 non-null    object 
 5   pop2023          207 non-null    int64  
 6   pop2030          207 non-null    int64  
 7   pop2050          207 non-null    object 
 8   country          207 non-null    object 
 9   area             207 non-null    object 
 10  landAreaKm       207 non-null    float64
 11  cca2             206 non-null    object 
 12  cca3             207 non-null    object 
 13  netChange        204 non-null    float64
 14  growthRate       207 non-null    float64
 15  worldPercentage  206 non-null    float64
 16  density          207 non-null    float64
 17  densityMi       

In [None]:
# display datatypes
df.dtypes

place                int64
pop1980              int64
pop2000              int64
pop2010              int64
pop2022             object
pop2023              int64
pop2030              int64
pop2050             object
country             object
area                object
landAreaKm         float64
cca2                object
cca3                object
netChange          float64
growthRate         float64
worldPercentage    float64
density            float64
densityMi          float64
rank                 int64
dtype: object

In [None]:
df.dtypes.value_counts()

int64      7
object     6
float64    6
dtype: int64

#### Display number of rows and columns.

In [None]:
df.shape

(207, 19)

In [None]:
df.columns

Index(['place', 'pop1980', 'pop2000', 'pop2010', 'pop2022', 'pop2023',
       'pop2030', 'pop2050', 'country', 'area', 'landAreaKm', 'cca2', 'cca3',
       'netChange', 'growthRate', 'worldPercentage', 'density', 'densityMi',
       'rank'],
      dtype='object')

In [None]:
# display place columns first 3 rows data
df['place'].head(3)

0    140
1    268
2      4
Name: place, dtype: int64

In [None]:
# display first 4 rows of place, pop2023 and area columns
df[['place', 'place', 'pop2023']].head(4)

Unnamed: 0,place,place.1,pop2023
0,140,140,5742315
1,268,268,3728282
2,4,4,42239854
3,8,8,2832439


#### Retrieving a Range of Rows

In [None]:
# for display 2nd to 6th rows
df[2:7]






Unnamed: 0,place,pop1980,pop2000,pop2010,pop2022,pop2023,pop2030,pop2050,country,area,landAreaKm,cca2,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
2,4,12486631,19542982,28189672,41128771,42239854,50330837,74075234,Afghanistan,652230,652230.0,AF,AFG,0.0357,0.027,0.0053,64.7622,167.7341,36
3,8,2941651,3182021,2913399,2842321,2832439,2789599,2456472,Albania,28748,27400.0,AL,ALB,-0.0003,-0.0035,0.0004,103.3737,267.7378,138
4,12,18739378,30774621,35856344,44903225,45606480,49787283,60001113,Algeria,2381741,2381741.0,DZ,DZA,0.0218,0.0157,0.0057,19.1484,49.5943,34
5,20,35611,66097,71519,79824,80088,81528,80504,Andorra,468,470.0,AD,AND,0.0,0.0033,0.0,170.4,441.336,203
6,24,8330047,16394062,23364185,35588987,36684202,44911664,72328068,Angola,1246700,1246700.0,AO,AGO,0.0352,0.0308,0.0046,29.425,76.2109,42


In [None]:
# disaplay from row 0 to 10
df[:11]

Unnamed: 0,place,pop1980,pop2000,pop2010,pop2022,pop2023,pop2030,pop2050,country,area,landAreaKm,cca2,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
0,140,2415276,3759170,4660067,5579144,5742315,7104274,11533423,,622984,622980.0,CF,CAF,0.0053,0.0292,0.0007,9.2175,23.8733,117
1,268,5145843,4265172,3836831,3744385,3728282,3657494,3384660,,69700,69490.0,GE,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132
2,4,12486631,19542982,28189672,41128771,42239854,50330837,74075234,Afghanistan,652230,652230.0,AF,AFG,0.0357,0.027,0.0053,64.7622,167.7341,36
3,8,2941651,3182021,2913399,2842321,2832439,2789599,2456472,Albania,28748,27400.0,AL,ALB,-0.0003,-0.0035,0.0004,103.3737,267.7378,138
4,12,18739378,30774621,35856344,44903225,45606480,49787283,60001113,Algeria,2381741,2381741.0,DZ,DZA,0.0218,0.0157,0.0057,19.1484,49.5943,34
5,20,35611,66097,71519,79824,80088,81528,80504,Andorra,468,470.0,AD,AND,0.0,0.0033,0.0,170.4,441.336,203
6,24,8330047,16394062,23364185,35588987,36684202,44911664,72328068,Angola,1246700,1246700.0,AO,AGO,0.0352,0.0308,0.0046,29.425,76.2109,42
7,28,64889,75055,85695,93763,94298,97510,99030,Antigua and Barbuda,442,440.0,AG,ATG,0.0,0.0057,0.0,214.3136,555.0723,201
8,32,28024803,37070774,41100123,45510318,45773884,47678560,51621175,Argentina,2780400,2736690.0,AR,ARG,0.0091,0.0058,0.0057,16.726,43.3203,33
9,51,3135123,3168523,2946293,2780469,2777970,2759528,2573465,Armenia,29743,28470.0,AM,ARM,0.0,-0.0009,0.0003,97.5753,252.7201,140


In [None]:
# for display last two rows
df[-2:]

Unnamed: 0,place,pop1980,pop2000,pop2010,pop2022,pop2023,pop2030,pop2050,country,area,landAreaKm,cca2,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
205,894,5720438,9891136,13792086,20017675,20569737,24676417,37460435,Zambia,752612,743390.0,ZM,ZMB,0.0178,0.0276,0.0026,27.6702,71.6658,63
206,716,7049926,11834676,12839771,16320537,16665409,19179393,26438589,Zimbabwe,390757,386850.0,ZW,ZWE,0.0112,0.0211,0.0021,43.0798,111.5766,74


 ### 3.2 Data Cleaning
After the explore our datasets may need to clean them for better analysis. Data coming in from multiple sources so It's possible to have an error in some values. This is where data cleaning becomes extremely important. In this section, we will delete unwanted columns, rename columns, correct appropriate data types, etc.


#### Delete Columns name

In [None]:
# Drop unwanted columns
df.drop(['cca2'], axis=1, inplace=True)

#### Change Columns name

In [None]:
# create new df_col dataframe from df.copy() method.
df_col = df.copy()

# rename columns name
df_col.rename(columns={"place": "country_name", "pop2023": "population"}, inplace=True)
df_col.head(3)

Unnamed: 0,country_name,pop1980,pop2000,pop2010,pop2022,population,pop2030,pop2050,country,area,landAreaKm,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
0,140,2415276,3759170,4660067,5579144,5742315,7104274,11533423,,622984,622980.0,CAF,0.0053,0.0292,0.0007,9.2175,23.8733,117
1,268,5145843,4265172,3836831,3744385,3728282,3657494,3384660,,69700,69490.0,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132
2,4,12486631,19542982,28189672,41128771,42239854,50330837,74075234,Afghanistan,652230,652230.0,AFG,0.0357,0.027,0.0053,64.7622,167.7341,36


#### Adding a new column to a DataFrame



In [None]:
# Add a new ajusted column which value will be population / 1000
df_col['population_thousand'] = df_col['population'] / 1000
df_col

Unnamed: 0,country_name,pop1980,pop2000,pop2010,pop2022,population,pop2030,pop2050,country,area,landAreaKm,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank,population_thousand
0,140,2415276,3759170,4660067,5579144,5742315,7104274,11533423,,622984,622980.0,CAF,0.0053,0.0292,0.0007,9.2175,23.8733,117,5742.315
1,268,5145843,4265172,3836831,3744385,3728282,3657494,3384660,,69700,69490.0,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132,3728.282
2,4,12486631,19542982,28189672,41128771,42239854,50330837,74075234,Afghanistan,652230,652230.0,AFG,0.0357,0.0270,0.0053,64.7622,167.7341,36,42239.854
3,8,2941651,3182021,2913399,2842321,2832439,2789599,2456472,Albania,28748,27400.0,ALB,-0.0003,-0.0035,0.0004,103.3737,267.7378,138,2832.439
4,12,18739378,30774621,35856344,44903225,45606480,49787283,60001113,Algeria,2381741,2381741.0,DZA,0.0218,0.0157,0.0057,19.1484,49.5943,34,45606.480
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,704,52968270,79001142,87411012,98186856,98858950,102699905,107012939,Vietnam,331212,313429.0,VNM,0.0208,0.0068,0.0123,315.4110,816.9145,16,98858.950
203,732,116775,270375,413296,575986,587259,662726,851067,Western Sahara,266000,266000.0,ESH,0.0004,0.0196,0.0001,2.2077,5.7180,172,587.259
204,887,9204938,18628700,24743946,33696614,34449825,39923245,55296331,Yemen,527968,527970.0,YEM,0.0240,0.0224,0.0043,65.2496,168.9964,44,34449.825
205,894,5720438,9891136,13792086,20017675,20569737,24676417,37460435,Zambia,752612,743390.0,ZMB,0.0178,0.0276,0.0026,27.6702,71.6658,63,20569.737


In [None]:
df_col.head(3)

Unnamed: 0,country_name,pop1980,pop2000,pop2010,pop2022,population,pop2030,pop2050,country,area,landAreaKm,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank,population_thousand
0,140,2415276,3759170,4660067,5579144,5742315,7104274,11533423,,622984,622980.0,CAF,0.0053,0.0292,0.0007,9.2175,23.8733,117,5742.315
1,268,5145843,4265172,3836831,3744385,3728282,3657494,3384660,,69700,69490.0,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132,3728.282
2,4,12486631,19542982,28189672,41128771,42239854,50330837,74075234,Afghanistan,652230,652230.0,AFG,0.0357,0.027,0.0053,64.7622,167.7341,36,42239.854


### 3.3 Remove duplicate

In [None]:
# Display duplicated entries
df_col.duplicated().sum()

2

In [None]:
# duplicate rows dispaly, keep arguments will--- 'first', 'last' and False
duplicate_value = df_col.duplicated(keep='first')

#Show the duplicates
df_col[duplicate_value]

Unnamed: 0,country_name,pop1980,pop2000,pop2010,pop2022,population,pop2030,pop2050,country,area,landAreaKm,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank,population_thousand
113,480,954865,1215930,1283330,1299469,1300557,1305425,1226235,Mauritius,2040,2030.0,MUS,0.0,0.0008,0.0002,640.6685,1659.3313,157,1300.557
168,90,233668,429978,540394,724273,740424,856264,1225407,Solomon Islands,28896,27990.0,SLB,0.0005,0.0223,0.0001,26.4532,68.5137,166,740.424


In [None]:
# dropping ALL duplicate values
df_col.drop_duplicates(keep = 'first', inplace = True)

### 3.4 Handling missing values

Handling missing values in the common task in the data pre-processing part. For many reasons most of the time we will encounter missing values. Without dealing with this we can't do the proper model building. For this section first, we will find out missing values then we decided how to handle them. We can handle this by removing affected columns or rows or replacing appropriate values there.

#### Display missing values information

In [None]:
df_col.isna().sum().sort_values(ascending=False)

netChange              3
worldPercentage        1
country_name           0
landAreaKm             0
rank                   0
densityMi              0
density                0
growthRate             0
cca3                   0
area                   0
pop1980                0
country                0
pop2050                0
pop2030                0
population             0
pop2022                0
pop2010                0
pop2000                0
population_thousand    0
dtype: int64

#### Delete Nan rows

If we have less Nan value then we can delete entire rows by `dropna()` function. For this function, we will add columns name in subset parameter.

In [None]:
# df copy to df_copy
df_new = df_col.copy()

In [None]:
#Delete Nan rows of Job Columns
df_new.dropna(subset = ["netChange"], inplace=True)

In [None]:
df_new

Unnamed: 0,country_name,pop1980,pop2000,pop2010,pop2022,population,pop2030,pop2050,country,area,landAreaKm,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
0,140,2415276,3759170,4660067,5579144,5742315,7104274,11533423,,622984,622980.0,CAF,0.0053,0.0292,0.0007,9.2175,23.8733,117
1,268,5145843,4265172,3836831,3744385,3728282,3657494,3384660,,69700,69490.0,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132
2,4,12486631,19542982,28189672,41128771,42239854,50330837,74075234,Afghanistan,652230,652230.0,AFG,0.0357,0.0270,0.0053,64.7622,167.7341,36
3,8,2941651,3182021,2913399,2842321,2832439,2789599,2456472,Albania,28748,27400.0,ALB,-0.0003,-0.0035,0.0004,103.3737,267.7378,138
4,12,18739378,30774621,35856344,44903225,45606480,49787283,60001113,Algeria,2381741,2381741.0,DZA,0.0218,0.0157,0.0057,19.1484,49.5943,34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,704,52968270,79001142,87411012,98186856,98858950,102699905,107012939,Vietnam,331212,313429.0,VNM,0.0208,0.0068,0.0123,315.4110,816.9145,16
203,732,116775,270375,413296,575986,587259,662726,851067,Western Sahara,266000,266000.0,ESH,0.0004,0.0196,0.0001,2.2077,5.7180,172
204,887,9204938,18628700,24743946,33696614,34449825,39923245,55296331,Yemen,527968,527970.0,YEM,0.0240,0.0224,0.0043,65.2496,168.9964,44
205,894,5720438,9891136,13792086,20017675,20569737,24676417,37460435,Zambia,752612,743390.0,ZMB,0.0178,0.0276,0.0026,27.6702,71.6658,63


In [None]:
#Check there an no Nan in netChange
df_new.isna().sum().sort_values(ascending=False)

country_name       0
pop1980            0
densityMi          0
density            0
worldPercentage    0
growthRate         0
netChange          0
cca3               0
landAreaKm         0
area               0
country            0
pop2050            0
pop2030            0
population         0
pop2022            0
pop2010            0
pop2000            0
rank               0
dtype: int64

#### Delete entire columns

If we have a large number of nan values in particular columns then dropping those columns might be a good decision rather than imputing.

In [None]:
df_new.drop(columns=['pop2050'], inplace=True)
df_new

Unnamed: 0,country_name,pop1980,pop2000,pop2010,pop2022,population,pop2030,country,area,landAreaKm,cca3,netChange,growthRate,worldPercentage,density,densityMi,rank
0,140,2415276,3759170,4660067,5579144,5742315,7104274,,622984,622980.0,CAF,0.0053,0.0292,0.0007,9.2175,23.8733,117
1,268,5145843,4265172,3836831,3744385,3728282,3657494,,69700,69490.0,GEO,-0.0005,-0.0043,0.0005,53.6521,138.9588,132
2,4,12486631,19542982,28189672,41128771,42239854,50330837,Afghanistan,652230,652230.0,AFG,0.0357,0.0270,0.0053,64.7622,167.7341,36
3,8,2941651,3182021,2913399,2842321,2832439,2789599,Albania,28748,27400.0,ALB,-0.0003,-0.0035,0.0004,103.3737,267.7378,138
4,12,18739378,30774621,35856344,44903225,45606480,49787283,Algeria,2381741,2381741.0,DZA,0.0218,0.0157,0.0057,19.1484,49.5943,34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202,704,52968270,79001142,87411012,98186856,98858950,102699905,Vietnam,331212,313429.0,VNM,0.0208,0.0068,0.0123,315.4110,816.9145,16
203,732,116775,270375,413296,575986,587259,662726,Western Sahara,266000,266000.0,ESH,0.0004,0.0196,0.0001,2.2077,5.7180,172
204,887,9204938,18628700,24743946,33696614,34449825,39923245,Yemen,527968,527970.0,YEM,0.0240,0.0224,0.0043,65.2496,168.9964,44
205,894,5720438,9891136,13792086,20017675,20569737,24676417,Zambia,752612,743390.0,ZMB,0.0178,0.0276,0.0026,27.6702,71.6658,63


**Method 2** - Impute Mean, Median and Mode

In [None]:
# Impute Mean in Amount_spent columns
mean_population = df_new['population'].mean()
print(f"mean_population{mean_population}")

#Impute Median in Age column
median_population = df_new['population'].median()
print(f"mean_population{mean_population}")



mean_population39426116.6617647
mean_population39426116.6617647


### 5.1. Calculating Basic statistical measurement

In [None]:
df_new.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
country_name,204.0,427.598,251.8131,4.0,211.0,424.0,642.25,894.0
pop1980,204.0,26669120.0,110924000.0,0.0,770901.5,4256327.0,13232790.0,1000000000.0
pop2000,204.0,30130150.0,119178400.0,9638.0,1215930.0,5469575.0,18836770.0,1264099000.0
pop2010,204.0,34231280.0,132513300.0,10241.0,1319484.0,7031848.0,23076060.0,1348191000.0
population,204.0,39426120.0,146507400.0,11396.0,1522580.0,8212436.0,28847130.0,1428628000.0
pop2030,204.0,41570720.0,151174800.0,-200032.0,1534538.0,8004986.0,33362820.0,1514994000.0
landAreaKm,204.0,638984.3,1795977.0,2.0,21410.0,110225.0,479421.7,16376870.0
netChange,204.0,0.01141765,0.03643509,-0.0286,0.0,0.00135,0.010525,0.4184
growthRate,204.0,0.01073873,0.0128021,-0.0745,0.0031,0.00895,0.019375,0.0498
worldPercentage,204.0,0.004926961,0.01830336,0.0,0.0002,0.00105,0.0036,0.1785


We know already above code will display only numeric columns basic statistical information. for object or category columns we can use `describe(include=object)` .

In [None]:
df_new.describe(include=object).T

Unnamed: 0,count,unique,top,freq
pop2022,204,202,724273.0,2
country,204,201,,2
area,204,202,28896.0,2
cca3,204,201,,2
