# DataFrames: 2nd Part

In this I will explore more functionalities of DataFrames
* Filtering rows
* Adding new columns and overriding the .csv file
* Removing columns
* Renaming columns
* Indexing and selecting data
* Sorting data
* Handling missing data
* Grouping data
* Simple exercise to find row data based on user input

In [1]:
# Import necessary libraries
import pandas as pd

# Create a sample DataFrame
df = pd.read_csv('data/world_countries.csv')
# Filter rows where COUNTRY GPD is greater than 60 Billion
print(f"Countries with GDP greater than 60 Billion: \n{df[df['GDP (BILLIONS)'] > 60]}\n")

Countries with GDP greater than 60 Billion: 
Empty DataFrame
Columns: [COUNTRY, GDP (BILLIONS), CODE]
Index: []



In [2]:
# Add a new column 'High GDP' that indicates if GDP is greater than 100 Billion (imaginary threshold)
df['GDP_Category'] = df['GDP (BILLIONS)'].apply(lambda x: 'High' if x > 100 else 'Low')
print(f"DataFrame with new column 'GDP_Category':\n{df.head()}\n")

# merge (override) the added col. with the.csv file
df.to_csv("data/world_countries.csv", index=False)

DataFrame with new column 'GDP_Category':
      COUNTRY  GDP (BILLIONS) CODE GDP_Category
0     Uruguay       16.711181  URY          Low
1    Bulgaria       16.711181  BGR          Low
2       Macau       16.711181  MAC          Low
3  Costa Rica       16.711181  CRI          Low
4    Slovenia       16.711181  SVN          Low



In [3]:
# Remove the 'GDP_Category' column
df = df.drop(columns=['GDP_Category'])
print(f"DataFrame after removing 'GDP_Category' column:\n{df.head()}\n")

DataFrame after removing 'GDP_Category' column:
      COUNTRY  GDP (BILLIONS) CODE
0     Uruguay       16.711181  URY
1    Bulgaria       16.711181  BGR
2       Macau       16.711181  MAC
3  Costa Rica       16.711181  CRI
4    Slovenia       16.711181  SVN



In [4]:
# Rename the 'COUNTRY' column to 'NATION'
df = df.rename(columns={'COUNTRY': 'NATION'})
print(f"DataFrame after renaming 'COUNTRY' to 'NATION':\n{df.head()}\n")

DataFrame after renaming 'COUNTRY' to 'NATION':
       NATION  GDP (BILLIONS) CODE
0     Uruguay       16.711181  URY
1    Bulgaria       16.711181  BGR
2       Macau       16.711181  MAC
3  Costa Rica       16.711181  CRI
4    Slovenia       16.711181  SVN



In [5]:
# Set 'NATION' as the index of the DataFrame
df = df.set_index('NATION')
print(f"DataFrame with 'NATION' as index:\n{df.head()}\n")

DataFrame with 'NATION' as index:
            GDP (BILLIONS) CODE
NATION                         
Uruguay          16.711181  URY
Bulgaria         16.711181  BGR
Macau            16.711181  MAC
Costa Rica       16.711181  CRI
Slovenia         16.711181  SVN



In [6]:
# Sort the DataFrame by 'GDP (BILLIONS)' in descending order
df = df.sort_values(by='GDP (BILLIONS)', ascending=False)
print(f"DataFrame sorted by 'GDP (BILLIONS)' in descending order:\n{df.head()}\n")

DataFrame sorted by 'GDP (BILLIONS)' in descending order:
                      GDP (BILLIONS) CODE
NATION                                   
Hong Kong                  47.285587  HKG
Egypt                      47.285587  EGY
Iran                       47.285587  IRN
United Arab Emirates       47.285587  ARE
Norway                     47.285587  NOR



In [7]:
# Handle missing data by filling NaN values with the mean of the column

# For demonstration, let's introduce some NaN values
import numpy as np

# Introduce NaN values in the first 20 rows of 'GDP (BILLIONS)'
df.iloc[:20, df.columns.get_loc('GDP (BILLIONS)')] = np.nan
print(f"DataFrame with introduced NaN values:\n{df.head()}\n")

# Fill missing values with the mean of each numeric column
df = df.fillna(df.mean(numeric_only=True))
print(f"DataFrame after handling missing data:\n{df.head()}\n")

# To fill only one columnâ€™s NaN values (not the whole DataFrame), use:
# df['GDP (BILLIONS)'].fillna(df['GDP (BILLIONS)'].mean(), inplace=True)

DataFrame with introduced NaN values:
                      GDP (BILLIONS) CODE
NATION                                   
Hong Kong                        NaN  HKG
Egypt                            NaN  EGY
Iran                             NaN  IRN
United Arab Emirates             NaN  ARE
Norway                           NaN  NOR

DataFrame after handling missing data:
                      GDP (BILLIONS) CODE
NATION                                   
Hong Kong                  14.108893  HKG
Egypt                      14.108893  EGY
Iran                       14.108893  IRN
United Arab Emirates       14.108893  ARE
Norway                     14.108893  NOR



In [8]:
# Group the DataFrame by 'GDP_Category' status and calculate the mean GDP for each group
# First, we need to re-add the 'GDP_Category' column for grouping
df['GDP_Category'] = df['GDP (BILLIONS)'].apply(lambda x: 'High' if x > 100 else 'Low')
grouped_df = df.groupby('GDP_Category').mean(numeric_only=True)
print(f"Grouped DataFrame by 'GDP_Category' level:\n{grouped_df}\n")

Grouped DataFrame by 'GDP_Category' level:
              GDP (BILLIONS)
GDP_Category                
Low                14.108893



In [9]:
# Simple exercise: Find and display data for a country based on user input (COUNTRY or CODE)
# First reset index as we changed it before
df = df.reset_index()
# And re-rename 'NATION' back to 'COUNTRY' as we changed it before
df = df.rename(columns={'NATION': 'COUNTRY'})
user_input = input("Enter a COUNTRY name or CODE to find its data: ")
result = df[(df['COUNTRY'].str.lower() == user_input.lower()) | (df['CODE'].str.lower() == user_input.lower())]
try:
  if not result.empty:
    print(f"Data for {user_input}:\n{result}\n")
  else:
    print(f"No data found for {user_input}.\n")
except Exception as e:
  print(f"An error occurred: {e}\n")

Data for PAK:
     COUNTRY  GDP (BILLIONS) CODE GDP_Category
31  Pakistan       30.224259  PAK          Low



In [10]:
# Remove the changes that been made to the original .csv file
if 'GDP_Category' in df.columns:
    df = df.drop(columns=['GDP_Category'])
df.to_csv("data/world_countries.csv", index=False)