# Analyzing Jeopardy Data - Data Cleaning and Insights

#### Summary:
This project explores the "Jeopardy" dataset to analyze questions and answers, focusing on filtering data based on keywords, cleaning monetary values, and identifying unique answers. The analysis aims to demonstrate data cleaning, transformation, and statistical computation skills.

#### This will be the Workflow Process:
 1. Load and investigate the dataset.
 2. Rename columns to fix formatting issues.
 3. Filter questions based on specific keywords.
 4. Clean and convert monetary values into a usable format.
 5. Identify unique answers and calculate average values for filtered data.
 6. Use functions to program, filter and manipulate data for better results when reauired

We import pandas library

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

*Note*: before we have instructed pandas to display the full content of each cell in a column, regardless of how long the text is. This can be helpful when working with text-heavy columns where you want to see the entire content without truncation.

We proceed to load the data, this time from a csv file and print the column namber and the head() basic info to inspect the first 8 rows

In [2]:
jeopardy_data = pd.read_csv("jeopardy.csv")
print("\nTe Dataset has the next columns:")
print(jeopardy_data.columns)


Te Dataset has the next columns:
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [3]:
display(jeopardy_data.head(8))

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect shared billing with a grasshopper",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,"Built in 312 B.C. to link Rome & the South of Italy, it's still in use today",the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls",Michael Jordan


##### The column´s names seems quite odd, so we can rename them as we pleased in order to have more clerance about the data stored in each column

In [4]:
jeopardy_data = jeopardy_data.rename(columns={
    " Air Date": "Broadcast Date",
    " Round": "Phase",
    " Category": "Topic",
    " Value": "Prize",
    " Question": "Clue",
    " Answer": "Solution"
})
print("\nRenamed Columns now are:")
print(jeopardy_data.columns)


Renamed Columns now are:
Index(['Show Number', 'Broadcast Date', 'Phase', 'Topic', 'Prize', 'Clue',
       'Solution'],
      dtype='object')


##### Explore the types od data of each column

In [5]:
print(jeopardy_data.dtypes)

Show Number        int64
Broadcast Date    object
Phase             object
Topic             object
Prize             object
Clue              object
Solution          object
dtype: object


We see that "Prize" column is set as a object type, wich is not correct as it refears to nuemerical values. Let´s convert it to float type

In [6]:
def clean_prize(value):
    try:
        # Attempt to clean and convert the value
        if pd.notnull(value) and value.startswith("$"):
            return float(value[1:].replace(",", ""))
        else:
            return 0
    except (ValueError, AttributeError):
        # Handle any unexpected formats gracefully
        return 0

jeopardy_data["Prize"] = jeopardy_data["Prize"].apply(clean_prize)

Let´s check if the convertion worked properly

In [7]:
print(jeopardy_data.dtypes)

Show Number         int64
Broadcast Date     object
Phase              object
Topic              object
Prize             float64
Clue               object
Solution           object
dtype: object


###### And yes, now the Prize columns is an integer type

##### Preview the clue

In [8]:
print("\nSample Clues:")
print(jeopardy_data["Clue"].head())


Sample Clues:
0               For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory
1    No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves
2                       The city of Yuma in this state has a record average of 4,055 hours of sunshine each year
3                           In 1963, live on "The Art Linkletter Show", this company served its billionth burger
4       Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States
Name: Clue, dtype: object


##### Filtering a dataset by a list of words

We are using a lambda function inside a common def function for this

In [9]:
def filter_data(data, words):
    """
    Filters the dataset for clues containing all specified words.
    
    Parameters:
    data (DataFrame): The dataset to filter.
    words (list): List of words to filter clues by.

    Returns:
    DataFrame: Filtered dataset.
    """
    filter_func = lambda x: all(word.lower() in x.lower() for word in words)
    return data.loc[data["Clue"].apply(filter_func)]

##### We can now test our filter function adn see how it works

In [10]:
filtered = filter_data(jeopardy_data, ["King", "England"])
print("\nFiltered Clues containing 'King' and 'England' words:")
print(filtered["Clue"].head())


Filtered Clues containing 'King' and 'England' words:
4953                   Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"
6337     In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man
9191                   This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt
11710              This Scotsman, the first Stuart king of England, was called "The Wisest Fool in Christendom"
13454                                      It's the number that followed the last king of England named William
Name: Clue, dtype: object


In [11]:
filtered = filter_data(jeopardy_data, ["Aesop ", "fable"])
print("\nFiltered Clues containing 'fable' and 'Aesop' words:")
print(filtered["Clue"])


Filtered Clues containing 'fable' and 'Aesop' words:
5                         In the title of an Aesop fable, this insect shared billing with a grasshopper
7154            In an Aesop fable, this animal decides the grapes he can't reach must therefore be sour
8753                                In the Aesop fable, he's so far ahead he takes a nap; what a loser!
26859                     In the title of an Aesop fable, this insect shared billing with a grasshopper
39731                        In an Aesop fable, Fox makes this bird drop its food by asking it to speak
89930               In an Aesop fable, these insects laugh at a hungry cicada who goofed off all summer
126430               "It is best to prepare for the days of necessity" is the moral of this Aesop fable
154723     In an Aesop fable a shepherd boy tests the patience of his village by repeatedly crying this
160246               An ant shares the billing with one of these insects in the title of an Aesop fable
194178    

##### Identyfing unique solutions

In [12]:
def get_solution_counts(data):
    """
    Counts the frequency of unique solutions in the dataset.
    
    Parameters:
    data (DataFrame): The dataset to analyze.

    Returns:
    Series: Frequency of unique solutions.
    """
    return data["Solution"].value_counts()

# Testing count function
print("\nUnique Solution Counts:")
print(get_solution_counts(filtered).head())


Unique Solution Counts:
Solution
the ant    1
a fox      1
Hare       1
Ant        1
Crow       1
Name: count, dtype: int64
