# Course Title: Data Analytics & Statistics in Python
## Metropolia University of Applied Sciences
## Week 1: Python ReCap
### Date: 5.3.2025

<table "<table style="width: 100%;">
  <tr>
    <td style="text-align: left; vertical-align: middle;">
      <ul style="list-style: none; padding-left: 0;">
        <li><strong>Instructor</strong>: Hamed Ahmadinia, Ph.D</li>
        <li><strong>Email</strong>: Hamed.Ahmadinia@metropolia.fi
        <li><strong>Web</strong>: www.ahmadinia.fi</li>          </li>
      </ul>
    </td>
  </tr>
</table>

# 1. Loading Dataset

## Loading the Dataset  
We will load the **Adult Income Dataset** from the provided GitHub URL. 
This dataset contains demographic data used for income classification tasks.
Let's load the data and take a quick look at the first few rows to understand its structure.  


In [188]:
# 1. Import necessary libraries
import pandas as pd  # pandas is used for handling tabular datasets (dataframes) and performing operations such as reading CSV files
import numpy as np  # numpy is used for numerical computations such as working with arrays and applying mathematical operations

# 2. Load dataset from GitHub URL
file_path = "https://raw.githubusercontent.com/Hamed-Ahmadinia/DASP-2025/main/adult.data.csv"  # URL link to the dataset stored on GitHub

# 3. Read the dataset into a pandas dataframe
df = pd.read_csv(file_path, header=0)  # header=0 means the first row in the CSV is used as column names

# 4. Display the first few rows of the dataframe to confirm the data has been loaded correctly
print("Dataset Preview:")  # Print a label for context
print(df.head(5))  # Display the first 5 rows of the dataset

Dataset Preview:
   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country salary  
0          2174             0              40  United-States  <=50K  
1             0          

# 1. Introduction

We will use the "Adult Income Dataset" for all code examples.  
This dataset contains demographic information used for classifying whether a person earns more or less than $50K/year.  
The main columns include:
- `age`: Age of the individual  
- `workclass`: Type of employer (e.g., "Private", "Self-employed")  
- `education`: Education level (e.g., "Bachelors", "Masters")  
- `income`: Income category ("<=50K" or ">50K")

# 2. Common Python Operations

### Common Python Operations  
In this section, we will perform some common Python operations using the dataset's columns. These operations are essential when transforming data, filtering datasets, or performing calculations during data analysis.  

The following operations will be demonstrated:
1. **Arithmetic Operations:** Addition, subtraction, multiplication, division, exponentiation, modulus, and floor division  
2. **Boolean Comparisons:** Greater than, less than, equal to, and not equal to  
3. **Assignment Shortcuts:** Combining operations with assignments (e.g., `+=`, `-=`, `*=`)  

---

### **Detailed Table of Common Operations in Python**

| **Name**       | **Symbol** | **Description**                                        | **Example (using columns from dataset)**            |
| ---------------| ---------- | ----------------------------------------------------- | --------------------------------------------------- |
| **Addition**   | `+`        | Adds two values of the same type                      1| `df['age'][0] + 5` → Adds 5 years to the first age  |
| **Subtraction**| `-`        | Subtracts one value from another                       | `df['1apital-gain'][0] - 1000` → Subtracts 1000 from capital gain |
| **Multiplication**| `*`    | Multiplies two values                                  | `df['h1urs-per-week'][0] * 2` → Doubles the hours worked |
| **Division**   | `/`        | Performs float division (returns a float result)       | `df['age'][1] / 2` → Divides age by 2               |
| **Floor Division**| `//`   | Returns the integer result of division (rounds down)   | `df['age'][1] // 2` → Integer division of age by 2  |
| **Exponentiation**| `**`   | Raises a number to the power of another                 | `2 ** 3` → `8` (2 raised to the power of 3)         |
| **Modulo**     | `%`        | Returns the remainder of a division                    | `10 % 3` → `1` (remainder when 10 is divided by 3)  |
| **Equal**      | `==`       | Compares two values for equality                  1    | `df['income'][0] == '>50K'` → Checks if income > 50K |
| **Not Equal**  | `!=`       | Compares two values for inequality                  1  | `df['education'][0] != 'Bachelors'` → Checks if education is not "Bachelors" |
| **Greater Than**| `>`       | Checks if the left value is greater          1         | `df['age'][0] > 40` → Checks if age is greater than 40 |
| **Less Than**  | `<`        | Checks if the left value is smaller                  1 | `df['capital-gain'][0] < 2000` → Checks if capital gain is less than 2000 |
| **Greater or Equal**| `>=`  | Checks if the left value is greater or equa1           | `df['age'][0] >= 30`                                |
| **Less or Equal**| `<=`     | Checks if the left value is smaller or equal         1 | `df['hours-per-week'][0] <= 40`                     |

---

### **Explanation of Assignment Shortcuts**
| **Operation**    | **Shorthand** | **Example Code**                | **Equivalent to**           |  
| ---------------- | ------------- | -------------------------------- | -------------------------- |
| Addition         | `+=`          | `x += 5` (adds 5 to `x`)         | `x = x + 5`                 |
| Subtraction      | `-=`          | `y -= 3` (subtracts 3 from `y`)  | `y = y - 3`                 |
| Multiplication   | `*=`          | `z *= 2` (multiplies `z` by 2)   | `z = z * 2`                 |
| Division         | `/=`          | `a /= 4` (divides `a` by 4)      | `a = a / 4`                 |
| Floor Division   | `//=`         | `b //= 2`                        | `b = b // 2`                |
| Exponentiation   | `**=`         | `c **= 3`                        | `c = c ** 3`                |
| Modulus          | `%=`          | `d %= 7`                       += 5  # Adds 5 to the age
print(f"Age after += 5: {age}")


In [300]:
# Display basic arithmetic operations using the 'age' column
print("First row's age:", df['age'].iloc[0]) # iloc[] to call a known index ocation (row in a table)

# Addition (increasing age by 5)
age_plus_5 = df['age'].iloc[0] + 5
print("Age after adding 5:", age_plus_5)

# Subtraction (decreasing age by 2)
age_minus_2 = df['age'].iloc[0] - 2
print("Age after subtracting 2:", age_minus_2)

# Multiplication (doubling the age)
double_age = df['age'].iloc[0] * 2
print("Age after multiplying by 2:", double_age)

# Boolean Comparison (is age > 30?)
print("Is the age greater than 30?", df['age'].iloc[0] > 30)

# Assignment shortcut example (not modifying DataFrame directly for safety)
age = df['age'].iloc[0]
age += 10  # Increase age by 10
print("Age after adding 10:", age)

First row's age: 39
Age after adding 5: 44
Age after subtracting 2: 37
Age after multiplying by 2: 78
Is the age greater than 30? True
Age after adding 10: 49


# 3. Data Types

### 3.1. Working with Strings in Python  

In data analysis, **text columns** often contain important categorical information, such as `workclass` (e.g., "Private", "Self-employed") or `education` (e.g., "Bachelors", "Masters"). Since strings can sometimes be inconsistent (extra spaces, case sensitivity), knowing how to clean and manipulate them is essential.

Below are some **common string operations** that can help clean and transform text data:

---

#### **Key String Operations**

| **Operation**    | **Method** | **Description**                                      | **Example**                                       |
|------------------|------------|----------------------------------------------------|--------------------------------------------------|
| **Remove Spaces**| `strip()`  | Removes leading and trailing whitespace             | `"  Data " → "Data"`                             |
| **To Uppercase** | `upper()`  | Converts all characters in the string to uppercase  | `"text".upper() → "TEXT"`                        |
| **To Lowercase** | `lower()`  | Converts all characters to lowercase                | `"TEXT".lower() → "text"`                        |
| **Splitting Text**| `split()` | Splits a string into a list of words based on spaces| `"Python Data".split() → ["Python", "Data"]`     |
| **Joining Words**| `join()`   | Joins elements of a list into a single string        | `" ".join(["Python", "Data"]) → "Python Data"`   |
| **Replacing Text**| `replace()`| Replaces a substring with another substring         | `"Python Data".replace("Data", "Science") → "Pytho")  # Display the raw string


In [322]:
# Example of string operations using the 'workclass' column
workclass_example = df['workclass'].iloc[2]  # Get the first row's workclass
print("Original workclass:", workclass_example)

# Remove whitespace
print("Workclass after stripping spaces:", workclass_example.strip())

# Convert to lowercase
print("Workclass in lowercase:", workclass_example.lower())

# Replace substring
print("Replacing 'Private' with 'Self-employed':", workclass_example.replace('Private', 'Self-employed'))

Original workclass: Private
Workclass after stripping spaces: Private
Workclass in lowercase: private
Replacing 'Private' with 'Self-employed': Self-employed


### 3.2 Lists in Python

Lists in Python are used to store ordered collections of values. Lists can hold multiple data types, but in data analysis, they typically store **sequences of related data** (e.g., categories or a list of values from a column).

---

#### **Key Characteristics of Lists:**
- **Ordered:** The order of elements in a list is maintained.
- **Mutable:** You can modify lists by adding, removing, or updating elements.
- **Indexable:** You can access specific elements using their index.

---

In the context of the **Adult Income Dataset**, the `education` column stores the education level of each individual (e.g., "Bachelors", "Masters"). You can store the **unique education levels** as a list to get an overview of all categories in thiscolumn.

---


In [325]:
# Convert the 'education' column to a list
education_list = df['education'].tolist() 

# Unique education levels
unique_education_levels = list(set(education_list))
print("Unique education levels:", unique_education_levels)

# Adding a new education level for illustration (dummy example)
education_list.append("Diploma")
print("Updated education list:", education_list[-5:])  # Display last 5 elements

Unique education levels: ['Assoc-voc', 'Doctorate', '5th-6th', 'Masters', 'HS-grad', 'Bachelors', '9th', 'Preschool', '1st-4th', 'Some-college', '7th-8th', '11th', 'Assoc-acdm', 'Prof-school', '12th', '10th']
Updated education list: ['HS-grad', 'HS-grad', 'HS-grad', 'HS-grad', 'Diploma']


### 3.3 Sets in Python 

A **set** in Python is a collection of **unique, unordered elements**. Sets automatically remove duplicates and are useful when working with **categorical columns** to identify **unique values**.

---

### **Why Use Sets in Data Analysis?**
- To find unique categories (e.g., distinct races, education levels).
- To check for duplicates or count distinct values.
- To perform set operations (e.g., intersection, union, difference) to compare data categories.

---

### **Key Set Operations**

| **Operation**      | **Syntax**           | **Description**                                   | **Example**                          |
|------------------- |--------------------- |--------------------------------------------------|-------------------------------------- |
| **Create a set**   | `set()`               | Converts a list or column into a set (unique values)| `set(df['race'])`                     |
| **Add to a set**   | `set.add(item)`       | Adds a new element to the set                    | `unique_races.add("Asian")`           |
| **Remove from set**| `set.remove(item)`    | Removes a specific item from the set              | `unique_races.remove("White")`        |
| **Check membership**| `item in set`        | Checks if an item exists in the set               | `"Black" in unique_races` → `True`    |
| **Union**          | `set1.union(set2)`    | Combines all unique elements from both sets       | `set1.union(set2)`                    |
| **Intersection**   | `set1.intersection(set2)`| Returns elements that exist in both sets        | `set1.intersection(set racial categories: {unique_races}")


In [308]:
# Example of set usage for unique workclasses
unique_workclasses = set(df['workclass'].dropna())
print("Unique workclasses:", unique_workclasses)

# Intersection and Union example
set1 = {"Private", "Self-employed"}
set2 = {"Government", "Self-employed"}
print("Intersection:", set1.intersection(unique_workclasses))
print("Union:", set1.union(unique_workclasses))

Unique workclasses: {'State-gov', 'Private', 'Never-worked', 'Local-gov', '?', 'Self-emp-not-inc', 'Federal-gov', 'Without-pay', 'Self-emp-inc'}
Intersection: {'Private'}
Union: {'State-gov', 'Private', 'Self-emp-inc', 'Self-employed', 'Never-worked', 'Local-gov', '?', 'Self-emp-not-inc', 'Federal-gov', 'Without-pay'}


### 3.4 Dictionaries in Python

Dictionaries store **data as key-value pairs**.  
- **Keys**: The identifiers used to access values (must be unique and immutable).  
- **Values**: The data associated with the keys (can be of any data type).  
- Dictionaries are useful for organizing related data in a structured format.  
For example, you can store details of a single individual from the dataset in a dictionary, where **each key corresponds to a column name**, and **each value corresponds to the respective data point** for that individual.

---

### **Key Features of Dictionaries**

| **Operation**   | **Syntax**                    | **Description**                                              | **Example**                                 |
|-----------------|--------------------------------|--------------------------------------------------------------|---------------------------------------------|
| **Access Value**| `dict[key]`                    | Accesses the value associated with a key                     | `person['age']` → Returns the age           |
| **Add/Update**  | `dict[key] = value`            | Adds a new key-value pair or updates an existing value        | `person['salary'] = '>50K'`                 |
| **Remove Key**  | `dict.pop(key)`                | Removes a key-value pair and returns the value                | `person.pop('salary')`                      |
| **Get Keys**    | `dict.keys()`                  | Returns a list of all keys in the dictionary                  | `person.keys()`                             |
| **Get Values**  | `dict.values()`                | Returns a list of all values in the dictionary                | `person.values()`                           |
| **Get Items**   | `dict.items()`                 | Returns a list of (key, value) pairs                          | `person.items()`               ary of Person Details: {person_details}")


In [310]:
# Example: Create a dictionary of the first row's data
first_person = df.iloc[0].to_dict()
print("Details of the first person:", first_person)

# Access value by key
print("Education level of first person:", first_person["education"])

# Add a new key-value pair
first_person["city"] = "Turku"
print("Updated dictionary:", first_person)

Details of the first person: {'age': 39, 'workclass': 'State-gov', 'fnlwgt': 77516, 'education': 'Bachelors', 'education-num': 13, 'marital-status': 'Never-married', 'occupation': 'Adm-clerical', 'relationship': 'Not-in-family', 'race': 'White', 'sex': 'Male', 'capital-gain': 2174, 'capital-loss': 0, 'hours-per-week': 40, 'native-country': 'United-States', 'salary': '<=50K'}
Education level of first person: Bachelors
Updated dictionary: {'age': 39, 'workclass': 'State-gov', 'fnlwgt': 77516, 'education': 'Bachelors', 'education-num': 13, 'marital-status': 'Never-married', 'occupation': 'Adm-clerical', 'relationship': 'Not-in-family', 'race': 'White', 'sex': 'Male', 'capital-gain': 2174, 'capital-loss': 0, 'hours-per-week': 40, 'native-country': 'United-States', 'salary': '<=50K', 'city': 'Turku'}


# 4. Control Flows

### **4.1 Conditional Statements in Python**

Conditional statements are used to control the flow of a program by making **decisions based on conditions**.  
A **condition** is an expression that evaluates to either **True** or **False**. Depending on the result, different parts of the code may or may not be executed.

In data analysis, conditional statements are commonly used to:
- Filter rows based on specific conditions (e.g., select only individuals with income `>50K`)
- Apply category labels (e.g., "Young", "Middle-aged", "Senior") based on numerical data (like `age`)
- Perform different actions for different data values (e.g., check if someone is working overtime or not)

---

### **Key Conditional Keywords**

| **Keyword** | **Description**                                                                 | **Example**                         |
|-------------|----------------------------------------------------------------------------------|--------------------------------------|
| **`if`**    | Executes the block of code if the condition is `True`                            | `if age > 40:`                       |
| **`elif`**  | Stands for "else if"; adds another condition to check if the first is `False`    | `elif age == 30:`                    |
| **`else`**  | Executes the block of code if none of the above conditions are `True`            | `else:`                              |

---

### **Flowchart of Conditional Statements**

To visualize the decision-making process, here’s a flowchart of Python conditionals:

<div style="text-align: center;">
  <img src="https://hands-on.cloud/wp-content/uploads/2021/06/1.-Conditionals-in-Python-if-statement-flow-diagram-746x1024.png" alt="Conditional Statements Flowchart" width="300">
</div>

---

### **Explanation of Flowchart**  
- The program starts at the top and checks the `if` condition.
- If the condition is `True`, the corresponding block of code runs.
- If the condition is `False`, it moves to the `elif` condition (if present) and checks again.
- If all conditions are `False`, the `else` block (if present) executes as a fallback.

---

In [312]:
# Example of filtering based on conditions
age = df['age'].iloc[0]
if age < 18:
    print("The individual is a minor.")
elif age < 65:
    print("The individual is an adult.")
else:
    print("The individual is a senior citizen.")

The individual is an adult.


### 4.2 For Loops

A **for loop** is used to **iterate over a sequence** (such as lists, tuples, strings, or columns in a dataframe).  

For loops allow us to:
- Perform repetitive operations on multiple elements (e.g., iterating over all rows in a dataframe column).
- Access specific items in a dataset (e.g., print the education level of the first 10 individuals).
- Perform batch actions (e.g., calculate sums or apply transformations to selected rows).

---

### **Key Syntax for `for` Loops**

| **Component**   | **Description**                                   |
|-----------------|---------------------------------------------------|
| **`element`**   | The variable used to store each item during iteration |
| **`iterable`**  | The object you want to loop over (e.g., list, dataframe column) |
| **Code Block**  | The code that runs for each item (indented under the `f

### **Flowchart of For a Loop Statement**

To visualize the process of a for loop, here’s a flowchart illustrating how it works:

<div style="text-align: center;">
  <img src="https://pythonguides.com/wp-content/uploads/2023/08/For-loop-flowchart-in-Python.jpeg" alt="Conditional Statements Flowchart" width="300">
</div>

---

### **Explanation of Flowchart**  
- The loop starts with the first element of the sequence.
- If there are more items in the sequence, the loop continues; otherwise, it stops.
- The indented block of code runs for the current item.
- The iterator moves to the next item and repeats the process.

---ecute for each element
r each element


In [314]:
# Example: Iterate through the first 5 rows of the 'occupation' column
print("Occupations of the first 5 individuals:")
for occupation in df['occupation'].head(5):
    print(occupation)

Occupations of the first 5 individuals:
Adm-clerical
Exec-managerial
Handlers-cleaners
Handlers-cleaners
Prof-specialty


# 5. Functions


**Functions** in Python are reusable blocks of code that perform specific tasks.  
Functions allow us to:
- Reuse code for repeated tasks (e.g., checking work hours for each person in the dataset).
- Improve code readability by breaking down logic into smaller, meaningful parts.
- Perform operations based on input values and return specific results.

---

### **Key Components of a Function**

| **Component**   | **Description**                                           | **Example**                |
|-----------------|-----------------------------------------------------------|----------------------------|
| **Function Name**| The name used to call the function                        | `check_overtime`            |
| **Parameters**   | Values passed to the function when calling it             | `(hours_per_week)`          |
| **Return Value** | The value that the function sends back as output          | `"Overtime"` or `"Regular"` |

**General Syntax of a Function:**

```python
def function_name(parameters):
    # Function ogic
    return value


In [316]:
# Define a function to categorize age
def categorize_by_age(age):
    if age < 18:
        return "Minor"
    elif age < 65:
        return "Adult"
    else:
        return "Senior"

# Apply the function to the first person's age
print("Age category for the first individual:", categorize_by_age(df['age'].iloc[0]))

Age category for the first individual: Adult


# 6. File Handling: Reading and Writing Data

**In Python, **file handling** allows you to:**
- **Read** data from external files (e.g., CSV, text files).
- **Write** or **save** processed data to files after transformations.
- This is useful for saving cleaned datasets, analysis results, or logs.

---

### **Key File Handling Methods in Pandas**

| **Operation**      | **Method**        | **Description**                                       | **Example**                                |
|------------------- |------------------ |----------------------------------------------------- |--------------------------------------------|
| **Read CSV**        | `pd.read_csv()`   | Reads data from a CSV file into a dataframe           | `df = pd.read_csv('file.csv')`              |
| **Write to CSV**    | `df.to_csv()`     | Saves the dataframe as a CSV file                     | `df.to_csv('output.csv')`                   |
| **Read Excel**      | `pd.read_excel()` | Reads data from an Excel file                         | `df = pd.read_excel('file.xlsx')`           |
| **Write to Excel**  | `df.to_excel()`   | Saves the dataframe to an Excel file                  | `df.to_excel('output.xlsx')`                |
| **Read Text File**  | `open()`          | Opens a file in read or write mode                    | `with open('file.txt', 'r') as fi  # Display the first 5 rows


In [318]:
# Save a sample of the dataset to CSV
sample_df = df.head(10)  # Take the first 10 rows
sample_df.to_csv("sample_output.csv", index=False)
print("Sample data saved to 'sample_output.csv'!")

# Read the saved CSV
read_sample = pd.read_csv("sample_output.csv")
print("Read data from saved CSV:")
print(read_sample)

Sample data saved to 'sample_output.csv'!
Read data from saved CSV:
   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   
5   37           Private  284582    Masters             14   
6   49           Private  160187        9th              5   
7   52  Self-emp-not-inc  209642    HS-grad              9   
8   31           Private   45781    Masters             14   
9   42           Private  159449  Bachelors             13   

          marital-status         occupation   relationship   race     sex  \
0          Never-married       Adm-clerical  Not-in-family  White    Male   
1     Married-civ-spouse    Exec-managerial        Husband  White    Male   
2               Di

# 7. Debugging in Python


**Debugging** is the process of identifying and fixing errors or unexpected behavior in your code.  
In data analysis, debugging is crucial because datasets can often contain missing values, unexpected types, or outliers that can cause errors.

One of the most common ways to handle errors gracefully is by using **`try/except`** blocks.  
This prevents the program from crashing when an error occurs and allows you to print helpful messages or apply corrective actions.

---

### **Key Components of `try/except`**

| **Component** | **Description**                                                                 |
|---------------|----------------------------------------------------------------------------------|
| **`try`**     | The block of code where errors might occur.                                      |
| **`except`**  | Handles exceptions (errors) raised in the `try` block and specifies corrective actions. |
| **`else`**    | Optional block that runs only if no exceptions occur in the `try` block.         |
| **`finally`** | Optional block that runs regardless of whether an error occurred or not.         |

---

### **Common Python Errors and Exceptions**

| **Exception**      | **Description**                                  | **Example**                           |
|--------------------|--------------------------------------------------|---------------------------------------|
| **`KeyError`**     | Raised when accessing a dictionary or dataframe column that doesn't exist | `df['nonexistent_column']`            |
| **`ValueError`**   | Raised when a function receives the wrong type of input | `int('abc')` (cannot convert a string to an integer) |
| **`ZeroDivisionError`** | Raised when attempting to divide by zero   | `10 / 0`                              |
| **`TypeError`**    | Raised when an operation is applied to the wrong type | `"string" + 1` (cannot adcolumn does not exist in the dataframe.")


In [320]:
# Example of try/except for handling KeyError
try:
    print("Attempting to access a non-existent column")
    print(df['nonexistent_column'])  # This column does not exist
except KeyError:
    print("Error: Column does not exist!")
finally:
    print("Debugging example completed.")

Attempting to access a non-existent column
Error: Column does not exist!
Debugging example completed.
