# Lesson 1: Introduction to DataFrames

A **DataFrame** is a two-dimensional, tabular data structure used in many programming languages, particularly in **Python's pandas library**. It is
similar to a spreadsheet or a SQL table and is one of the most commonly used data structures in data manipulation and analysis.

### Key Features of a DataFrame:
1. **Rows and Columns:**
    - A DataFrame consists of rows (observations) and columns (variables), allowing for structured representation of data.
    - Each column typically represents a feature or attribute, while each row contains a data record (A type of information).

2. **Labeled Indexes:**
    - Both rows and columns have labels, which makes it easier to identify, access, and operate on specific data.

3. **Heterogeneous Data Types:**
    - A DataFrame can store heterogeneous data types, meaning that different columns can hold different kinds of data (e.g., integers, strings, floats, etc.).

4. **Powerful Data Manipulation and Analysis Functions:**
    - It is equipped with a wide range of functions to manipulate, filter, reshape, and analyze data efficiently.

In [1]:
import pandas as pd

# Create a dictionary or different users data
data = {"Name":["user1","user2","user3"],
        "Age":[20,21,22],
        "City":["Bern","Cairo","Moscow"]}

# Convert the data dict. into a dataFrame
df = pd.DataFrame(data) # Note that here DataFrame starts with the capital "D" as it is a constructor rather than a function
print(df)

    Name  Age    City
0  user1   20    Bern
1  user2   21   Cairo
2  user3   22  Moscow


In [2]:
# Create a DataFrame with custom row indexes "labels" (similar to Pandas behavior)
df = pd.DataFrame(data, index=[101,102,103])
print(df)

      Name  Age    City
101  user1   20    Bern
102  user2   21   Cairo
103  user3   22  Moscow


In [3]:
# Access the row with index label 102
print(df.loc[102])

Name    user2
Age        21
City    Cairo
Name: 102, dtype: object


In [4]:
# Access the row with by integer location
print(df.iloc[0])

Name    user1
Age        20
City     Bern
Name: 101, dtype: object


In [5]:
# Add a new column to the DataFrame
df["department"] = ["IT", "HR", "Finance"]

# Create a new DataFrame representing one additional row
# Note that we can add more than one row just by adding more dictionaries ([{},{},{}])
new_row = pd.DataFrame(
    [{"Name": "user4", "Age": 23, "City": "Paris", "department": "UI/UX"}],
    index=[104]
)

# Append the new row to the existing DataFrame
df = pd.concat([df, new_row])
print(df)

      Name  Age    City department
101  user1   20    Bern         IT
102  user2   21   Cairo         HR
103  user3   22  Moscow    Finance
104  user4   23   Paris      UI/UX


#### Working with CSV Files in Panda

In [6]:
# 1- Read the CSV file into a DataFrame
df = pd.read_csv("world_countries.csv")

# 2- Preview the dataset
# Display the first 5 rows to understand the structure of the data
# Note: Printing the whole DataFrame displays all data (maybe truncated for large datasets) first and last 5 rows
# To print the whole DataFrame we would call the "to_string" function as "print(df.to_string())"
print("First 5 rows of the dataset:")
print(df.head())

# Check basic information — column names, data types, and missing values
print("\n Dataset Basic information:")
print(df.info())

First 5 rows of the dataset:
          COUNTRY  GDP (BILLIONS) CODE
0     Afghanistan           21.71  AFG
1         Albania           13.40  ALB
2         Algeria          227.80  DZA
3  American Samoa            0.75  ASM
4         Andorra            4.80  AND

 Dataset Basic information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222 entries, 0 to 221
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   COUNTRY         222 non-null    object 
 1   GDP (BILLIONS)  222 non-null    float64
 2   CODE            222 non-null    object 
dtypes: float64(1), object(2)
memory usage: 5.3+ KB
None


In [7]:
# We can Display certain data by row or col.
print("Rows 0 to 1:")
print(df.loc[0:1])

print("\nColumn 'COUNTRY':")
print(df["COUNTRY"]) # This will print the first and last 5 rows

Rows 0 to 1:
       COUNTRY  GDP (BILLIONS) CODE
0  Afghanistan           21.71  AFG
1      Albania           13.40  ALB

Column 'COUNTRY':
0         Afghanistan
1             Albania
2             Algeria
3      American Samoa
4             Andorra
            ...      
217    Virgin Islands
218         West Bank
219             Yemen
220            Zambia
221          Zimbabwe
Name: COUNTRY, Length: 222, dtype: object


In [8]:
# We can pass an argument to the read_csv index_col="COUNTRY" sets the 'COUNTRY' column as the row index for easier lookup
import pandas as pd
df = pd.read_csv("world_countries.csv", index_col="COUNTRY")
country_name = "Egypt"
print(f"Data for {country_name}")
print(df.loc[country_name])

Data for Egypt
GDP (BILLIONS)    284.9
CODE                EGY
Name: Egypt, dtype: object


#### Basic exploration

In [9]:
# Display summary statistics for numerical columns
print("\nSummary statistics:")
print(df.describe())

# Check how many rows and columns the dataset has
print("\nShape of the DataFrame (rows, columns):")
print(df.shape)


Summary statistics:
       GDP (BILLIONS)
count      222.000000
mean       352.637162
std       1464.855533
min          0.010000
25%          4.615000
50%         21.525000
75%        196.200000
max      17420.000000

Shape of the DataFrame (rows, columns):
(222, 2)
