#### Name: Stuti Upadhyay
#### Campus ID: XT81177
#### Instructor: Chalachew Jemberie

# Week3 -Python Basic Skills on Real Datasets

The homework assignment is focus on foundational Python skills, using built-in functions and data structures without relying on external libraries like `pandas` or visualization tools. 

## Homework1:Analyzing Census Income Data

**Objective**: Write Python functions to load, process, and analyze a subset of the "Adult" (Census Income) dataset to extract insights using only basic Python data structures and functions.


We'll use the **"Adult" dataset** for a data analysis task that can be completed with basic Python.

#### Task Overview

1. **Data Loading and Processing**:
   - Write a function to read the dataset from a CSV file and store it in a list of lists (or a list of dictionaries, if you're comfortable with dictionaries). Each inner list/dictionary should represent a single person's information.
   
2. **Data Cleaning**:
   - Write a function to clean the dataset (e.g., replace missing values represented by "?" with `None` or a relevant placeholder).

3. **Basic Analysis**:
   - Write functions to calculate:
     - The average age of individuals in the dataset.
     - The distribution of individuals by education level (i.e., the count of individuals for each education level).

4. **Income Level Analysis**:
   - Write a function to determine the percentage of individuals making more than $50,000 a year, grouped by education level.




#### Dataset Details

- You can download a simplified version of the dataset, which includes a subset of the columns: [Download Sample Dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data)

- Consider using a smaller sample of the dataset for ease of testing.

#### Sample Dataset Columns

For simplicity, use these columns: Age, Workclass, Education, Occupation, and Income.



### Data Loading

In [17]:
import csv

# Function to read the dataset from a CSV file and store it in a list of dictionaries
def read_dataset(filename):
    with open(filename, newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        dataset = [row for row in reader]
    return dataset

# Function to clean the dataset
def clean_dataset(dataset):
    for row in dataset:
        # Replace missing values represented by "?" with None
        for key, value in row.items():
            if value == "?":
                row[key] = None
    return dataset

# Function to calculate the average age of individuals in the dataset
def calculate_average_age(dataset):
    ages = [int(row['age']) for row in dataset if row['age'] is not None]
    if len(ages) == 0:
        return None
    return sum(ages) / len(ages)

# Function to calculate the distribution of individuals by education level
def calculate_education_distribution(dataset):
    education_levels = {}
    for row in dataset:
        education_level = row['education']
        if education_level not in education_levels:
            education_levels[education_level] = 0
        education_levels[education_level] += 1
    return education_levels

# Function to determine the percentage of individuals making more than $50,000 a year, grouped by education level
def calculate_income_percentage(dataset):
    total_count = len(dataset)
    high_income_count_by_education = {}
    for row in dataset:
        education_level = row['education']
        income = row['income']
        if income == '>50K':
            if education_level not in high_income_count_by_education:
                high_income_count_by_education[education_level] = 0
            high_income_count_by_education[education_level] += 1
    
    income_percentage_by_education = {}
    for education_level, high_income_count in high_income_count_by_education.items():
        total_count_for_education = sum(1 for row in dataset if row['education'] == education_level)
        percentage = (high_income_count / total_count_for_education) * 100
        income_percentage_by_education[education_level] = percentage
    
    return income_percentage_by_education

In [19]:
# Main function
def main():
    filename = 'adult.csv'
    dataset = read_dataset(filename)
    cleaned_dataset = clean_dataset(dataset)
    
    # Basic Analysis
    average_age = calculate_average_age(cleaned_dataset)
    education_distribution = calculate_education_distribution(cleaned_dataset)
    
    print("Average Age:", average_age)
    print("Education Distribution:")
    for education, count in education_distribution.items():
        print(f"{education}: {count}")
    
    # Income Level Analysis
    income_percentage = calculate_income_percentage(cleaned_dataset)
    print("\nIncome Percentage by Education Level:")
    for education, percentage in income_percentage.items():
        print(f"{education}: {percentage}%")

if __name__ == "__main__":
    main()

Average Age: 38.64358543876172
Education Distribution:
11th: 1812
HS-grad: 15784
Assoc-acdm: 1601
Some-college: 10878
10th: 1389
Prof-school: 834
7th-8th: 955
Bachelors: 8025
Masters: 2657
Doctorate: 594
5th-6th: 509
Assoc-voc: 2061
9th: 756
12th: 657
1st-4th: 247
Preschool: 83

Income Percentage by Education Level:
Assoc-acdm: 25.79637726420987%
Some-college: 18.964883250597538%
Prof-school: 73.98081534772182%
HS-grad: 15.857830714647744%
Masters: 54.91155438464433%
Doctorate: 72.55892255892256%
Bachelors: 41.283489096573206%
Assoc-voc: 25.327510917030565%
9th: 5.423280423280423%
10th: 6.263498920086392%
7th-8th: 6.492146596858639%
11th: 5.077262693156733%
5th-6th: 5.304518664047151%
1st-4th: 3.2388663967611335%
12th: 7.30593607305936%
Preschool: 1.2048192771084338%


## Additional Task

- Review the provided Python script, understand its functionality, and run it with a dataset of your choice that matches the expected format.

- Analysis: Identify any potential issues or limitations with the script that could arise with different datasets.

- Documentation: For each function you modify or add, include a docstring explaining its purpose, inputs, and outputs. Additionally, comment on your reasoning for any significant changes or decisions you made in your code.

## Homework2:Text Analysis with "The Adventures of Sherlock Holmes"

Objective: Write Python functions to load and analyze text data from "The Adventures of Sherlock Holmes" by Arthur Conan Doyle. The analysis will include counting words, finding unique words, and identifying the most common words.


### Dataset Details

Text Source: Project Gutenberg's "The Adventures of Sherlock Holmes"

Direct Download URL: https://www.gutenberg.org/files/1661/1661-0.txt

You might need to download the text file manually or programmatically download it within your Python script (consider using requests library for downloading if going the programmatic route, but remember to respect robots.txt and usage policies of websites).

# Task Overview

**Data Loading:**

- Write a function to read the entire text of "The Adventures of Sherlock Holmes" into a single string.

**Data Cleaning and Preprocessing:**

- Write a function to remove punctuation and make all words lowercase to ensure accurate word counting.

**Word Frequency Analysis:**

- Write a function to count how often each word appears in the text and identify the top 10 most frequent words.

**Unique Words:**
- Write a function to find all unique words in the text.

**Explain Each Step:**

- For each function written, include a docstring that explains what the function does, its inputs, and its outputs. Additionally, add comments throughout your code to explain each step of your logic.

In [3]:
import string
import requests
from collections import Counter
from bs4 import BeautifulSoup

#### Data Loading 

In [12]:
def load_text(url):
    """
    Load the text of "The Adventures of Sherlock Holmes" from the provided URL.
    
    Args:
    url (str): The URL from which to download the text.
    
    Returns:
    str: The entire text of "The Adventures of Sherlock Holmes" as a single string.
    """
    # Download the HTML content from the URL
    response = requests.get(url)
    
    # Use BeautifulSoup to parse the HTML
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Get all text from the webpage
    text = soup.text
    
    return text

#### 1) Data Cleaning and Preprocessing

In [7]:
def clean_text(text):
    """
    Clean the text by removing punctuation and converting all words to lowercase.
    
    Args:
    text (str): The text to clean.
    
    Returns:
    str: The cleaned text.
    """
    # Remove punctuation
    cleaned_text = text.translate(str.maketrans("", "", string.punctuation))
    
    # Convert all words to lowercase
    cleaned_text = cleaned_text.lower()
    
    return cleaned_text

##### 2) Word Frequency Analysis
This function counts the frequency of each word in the text and identifies the top 10 most common words.

In [9]:
def word_frequency(text):
    """
    Count how often each word appears in the text and identify the top 10 most frequent words.
    
    Args:
    text (str): The text to analyze.
    
    Returns:
    list: A list of tuples containing the top 10 most frequent words and their frequencies.
    """
    # Split the text into words
    words = text.split()
    
    # Count the frequency of each word
    word_counts = Counter(words)
    
    # Get the top 10 most frequent words
    top_10_words = word_counts.most_common(10)
    
    return top_10_words

##### 3) Unique Words
This function finds all unique words in the cleaned text.

In [10]:
def unique_words(text):
    """
    Find all unique words in the text.
    
    Args:
    text (str): The text to analyze.
    
    Returns:
    set: A set containing all unique words in the text.
    """
    # Split the text into words
    words = text.split()
    
    # Get the unique words using set
    unique_words_set = set(words)
    
    return unique_words_set


##### The main function to perform all the tasks

In [13]:
# Main function to perform the analysis
def main():
    # URL of the text
    url = "https://www.gutenberg.org/files/1661/1661-0.txt"
    
    # Load the text
    text = load_text(url)
    
    # Clean the text
    cleaned_text = clean_text(text)
    
    # Word frequency analysis
    top_10_words = word_frequency(cleaned_text)
    print("Top 10 most frequent words:")
    for word, count in top_10_words:
        print(f"{word}: {count}")
    
    # Unique words
    unique_words_set = unique_words(cleaned_text)
    print("\nNumber of unique words:", len(unique_words_set))

if __name__ == "__main__":
    main()

Top 10 most frequent words:
the: 5717
and: 2936
of: 2765
to: 2735
a: 2652
i: 2600
in: 1789
that: 1655
it: 1477
he: 1417

Number of unique words: 10398
