<a href="https://colab.research.google.com/github/gitmystuff/DTSC4050/blob/main/Week_10-Regression_I/Data_Science_Fiction_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Fiction I - Part 1

Your Name

## The Prompt

**Structured Prompt:**

"You are a science fiction storyteller. Create a short story (approximately 3-4 paragraphs) based on the following dataset, which describes conditions within a fictional system or environment. The story should incorporate the following variables and their potential meanings in a fictional context:

* **feature_1:** [Describe a fictional phenomenon that could be represented by this numerical feature. For example, 'temporal energy fluctuations', 'biological mutation rate', 'psychic resonance levels', etc.]
* **feature_2:** [Describe a fictional phenomenon that could be represented by this numerical feature. For example, 'atmospheric distortion index', 'neural network feedback', 'reality shift magnitude', etc.]
* **feature_3:** [Describe a fictional phenomenon that could be represented by this numerical feature. For example, 'technological anomaly frequency', 'genetic drift coefficient', 'dimensional rift occurrence', etc.]
* **feature_4:** [Describe a fictional phenomenon that could be represented by this numerical feature. For example, 'cybernetic integration level', 'mental stability index', 'quantum entanglement variance', etc.]
* **feature_5:** [Describe a fictional phenomenon that could be represented by this numerical feature. For example, 'environmental decay rate', 'sentient AI activity', 'subspace communication interference', etc.]
* **target_variable:** [Describe a fictional outcome that could be represented by this numerical target variable. For example, 'population stability', 'resource production efficiency', 'anomaly containment success', etc.]
* **categorical_1:** [Describe a categorical variable, like 'species type', 'technological class', 'environmental zone', etc.]
* **categorical_2:** [Describe a categorical variable, like 'social hierarchy', 'communication protocol', 'dimensional access level', etc.]
* **name_1:** [Use these as character first names.]
* **name_2:** [Use these as character last names.]
* **zip_code:** [Use these as location identifiers or zone codes.]
* **date:** [Use these as dates of important events or observations.]
* **location:** [Use these as the original location or region of a character or object.]

The story should include a central mystery or problem related to the `target_variable` and how it is influenced by the other variables. Make it clear that some of the data is missing, and that this makes the problem more difficult to solve. The story should also imply that a linear regression model could be used to better understand the relationship between the target variable and the features.

Please ensure the story is engaging and uses the generated character names and location identifiers naturally. The story should include at least 3 characters.

Example of how to write the story.
In the year 3042, the inhabitants of the biodome known as sector 7 were facing a crisis. The population stability was rapidly declining, and no one knew why. Dr. Anya Sharma, using the data collected, believed that the temporal energy fluctuations were the root cause. While other sectors were thriving, sector 7 struggled to survive. She wanted to use a linear regression model to prove her theory, but the data was incomplete.

Please create the story."

**Tips for Writing Structured Prompts:**

1.  **Define the Role:**
    * Start by clearly defining the AI's role (e.g., "You are a science fiction storyteller"). This sets the context and guides the AI's response.
2.  **Specify the Task:**
    * Clearly state what you want the AI to do (e.g., "Create a short story...").
3.  **Provide Context:**
    * Give the AI any necessary background information (e.g., "based on the following dataset...").
4.  **List Variables and Their Meanings:**
    * Explicitly list each variable and provide a potential fictional meaning. This helps the AI understand how to incorporate the data into the story.
5.  **Set Constraints:**
    * Specify any constraints, such as the length of the story or specific elements to include (e.g., "approximately 3-4 paragraphs").
6.  **Provide Examples:**
    * Providing a short example of how to incorporate the data into the story can be very helpful.
7.  **Use Clear Formatting:**
    * Use bullet points or numbered lists to organize the information and make the prompt easier to read.
8.  **Be Specific:**
    * The more specific you are, the better the AI's output will be.
9.  **Iterate and Refine:**
    * Don't be afraid to experiment with different phrasings and structures. You may need to refine your prompt based on the AI's initial responses.
10. **Include Example Data:**
    * Adding a few example data points can further improve the AI's understanding of the data.

By using a structured prompt, you can guide the AI to generate a story that is both creative and relevant to your dataset, providing a valuable learning experience for your students.


## Create the Data

In [None]:
pip install Faker

In [None]:
import numpy as np
import pandas as pd
import random
from sklearn.datasets import make_regression
from faker import Faker

fake = Faker()

output = []
for x in range(100):
    sex = np.random.choice(['egg', 'seed'], p=[0.5, 0.5])
    output.append({
        'categorical_1': sex,
        'categorical_2': np.random.choice(['A', 'B', 'C']),
        'name_1': fake.first_name_female() if sex == 'egg' else fake.first_name_male(),
        'name_2': fake.last_name(),
        'zip_code': fake.zipcode(),
        'date': fake.date_of_birth(),
        'location': fake.state_abbr()
    })

demographics = pd.DataFrame(output)

def make_null(r, w):
    if random.randint(0, 99) < w:
        return np.nan
    else:
        return r

# Generating features for linear regression with generic names
features, target = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

generic_cols = ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5']
# random.shuffle(generic_cols)
df = pd.DataFrame(data=features, columns=generic_cols)
df['target_variable'] = target

# Introduce non-linearities and interactions
df['feature_1_squared'] = df['feature_1']**2
df['interaction_1_2'] = df['feature_1'] * df['feature_2']

# Apply transformations and add noise
df['target_variable'] = df['target_variable'] + np.random.normal(0, 5, 100)
df['feature_4'] = df['feature_4'].apply(lambda x: abs(x) if x < 0 else x)

# Add missing values
for col in generic_cols:
    df[col] = df[col].apply(make_null, args=(10,))

df = pd.concat([df, demographics], axis=1)

print(df.shape)
print(df.info())
df.head()

(100, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   feature_1          88 non-null     float64
 1   feature_4          93 non-null     float64
 2   feature_2          90 non-null     float64
 3   feature_3          94 non-null     float64
 4   feature_5          90 non-null     float64
 5   target_variable    100 non-null    float64
 6   feature_1_squared  100 non-null    float64
 7   interaction_1_2    100 non-null    float64
 8   categorical_1      100 non-null    object 
 9   categorical_2      100 non-null    object 
 10  name_1             100 non-null    object 
 11  name_2             100 non-null    object 
 12  zip_code           100 non-null    object 
 13  date               100 non-null    object 
 14  location           100 non-null    object 
dtypes: float64(8), object(7)
memory usage: 11.8+ KB
None


Unnamed: 0,feature_1,feature_4,feature_2,feature_3,feature_5,target_variable,feature_1_squared,interaction_1_2,categorical_1,categorical_2,name_1,name_2,zip_code,date,location
0,,0.677162,-0.012247,-0.897254,0.075805,-76.106337,0.950858,-0.011942,seed,B,John,Hart,5069,1984-01-25,GA
1,0.081874,0.485364,0.758969,-0.772825,,-60.157642,0.006703,0.06214,egg,C,Lindsay,Kelly,77965,1963-02-15,DC
2,-1.412304,0.908024,,-1.012831,0.314247,-264.612304,1.994602,0.794121,seed,B,Jeffrey,Ray,19119,1999-12-20,MH
3,-0.64512,0.361636,1.35624,-0.07201,1.003533,110.98138,0.416179,-0.874937,seed,B,Dalton,King,76616,1985-12-22,PA
4,-0.6227,0.280992,-1.952088,-0.151785,0.588317,-124.691544,0.387755,1.215564,egg,C,Megan,Banks,73073,1928-08-18,WV


## Example Story

**The Great Martian Population Mystery**

In the year 2342, humanity has established thriving colonies across Mars. However, the Martian Central Command has noticed a perplexing trend: the population growth rate of these colonies varies wildly. They have collected data on several factors, including:

* **asteroid_impact:** The frequency and severity of nearby asteroid impacts.
* **solar_flare_intensity:** The intensity of solar flares affecting the colony.
* **alien_signal_strength:** The strength of mysterious alien signals detected.
* **temporal_anomaly_index:** A measure of temporal anomalies observed in the region.
* **cybernetic_enhancement_level:** The average level of cybernetic enhancements among the colonists.
* **colony_population_growth:** The observed population growth rate.

Additionally, they have demographic data on the colonists, including sex, brain wave patterns, names, zipcodes, birthdates, and state of origin (from Earth).

Martian Central Command needs your help to build a linear regression model that can predict the colony population growth rate based on these factors. They suspect that some of the factors may have non-linear relationships or interact with each other. They also know that some data is missing.

**Your Task:**

1.  Analyze the provided `data_science_fiction.csv` dataset.
2.  Clean and preprocess the data, handling missing values and potential outliers.
3.  Build a linear regression model to predict `colony_population_growth`.
4.  Evaluate the model's performance and interpret the coefficients.
5.  Write a report explaining your findings and providing insights into the factors that influence Martian colony population growth.
6. Create visualizations that help to explain your findings.

## Get Creative

In [None]:
def fake_colony_name(): # only relevant to example story
    """Generates a fake colony name with a sci-fi feel."""
    name_formats = [
        "Colony " + fake.city_suffix() + " " + fake.word().capitalize(),
        fake.word().capitalize() + " " + fake.word().capitalize() + " Outpost",
        "Sector " + str(fake.random_int(min=1, max=100)) + " " + fake.word().capitalize(), # Convert the integer to a string using str()
        fake.word().capitalize() + " " + "Station " + str(fake.random_int(min=1, max=50)), # Convert the integer to a string using str()
        "Terra " + fake.word().capitalize(),
        fake.word().capitalize() + "-" + fake.word().capitalize() + " Base",
        fake.word().capitalize() + " " + "Settlement"
    ]
    return random.choice(name_formats)

def add_colony_names_to_dataframe(df, num_colonies): # only relevant to example story
    """Adds a 'colony_name' column to a DataFrame."""
    colony_names = [fake_colony_name() for _ in range(num_colonies)]
    df['colony_name'] = colony_names
    return df

# Add colony names to the DataFrame
df = add_colony_names_to_dataframe(df, len(df)) # only relevant to example story

## Rename Columns to Fit Story

In [None]:
def rename_columns(df, name_mapping):
    """
    Renames columns in a DataFrame based on a provided mapping.

    Args:
        df (pd.DataFrame): The DataFrame to rename columns in.
        name_mapping (dict): A dictionary where keys are generic column names
                              and values are the desired new column names.

    Returns:
        pd.DataFrame: The DataFrame with renamed columns.
    """
    return df.rename(columns=name_mapping)

# Example Usage (Students would create their own name_mapping)
example_name_mapping = {
    'feature_1': 'asteroid_impact',
    'feature_2': 'solar_flare_intensity',
    'feature_3': 'alien_signal_strength',
    'feature_4': 'temporal_anomaly_index',
    'feature_5': 'cybernetic_enhancement_level',
    'target_variable': 'colony_population_growth',
    'feature_1_squared' : 'asteroid_impact_squared',
    'interaction_1_2' : 'solar_flare_interaction',
    'categorical_1' : 'sex',
    'categorical_2' : 'brain_wave',
    'name_1': 'given_name',
    'name_2': 'surname',
    'zip_code': 'zipcode',
    'date': 'date_of_birth',
    'location': 'state_of_origin'
}

# Add missing values to demographic data
df['categorical_1'] = df['categorical_1'].apply(make_null, args=(5,))
df['feature_5'] = df['feature_5'].apply(make_null, args=(5,))

dupes = df.loc[0:7]
df = pd.concat([df, dupes], axis=0)
df = df.sample(frac=1).reset_index(drop=True)

df = rename_columns(df, example_name_mapping)

df.to_csv('data_science_fiction.csv')

print(df.shape)
print(df.info())
df.head()

(108, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108 entries, 0 to 107
Data columns (total 16 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   asteroid_impact               94 non-null     float64
 1   temporal_anomaly_index        101 non-null    float64
 2   solar_flare_intensity         97 non-null     float64
 3   alien_signal_strength         101 non-null    float64
 4   cybernetic_enhancement_level  93 non-null     float64
 5   colony_population_growth      108 non-null    float64
 6   asteroid_impact_squared       108 non-null    float64
 7   solar_flare_interaction       108 non-null    float64
 8   sex                           99 non-null     object 
 9   brain_wave                    108 non-null    object 
 10  given_name                    108 non-null    object 
 11  surname                       108 non-null    object 
 12  zipcode                       108 non-null    object 


Unnamed: 0,asteroid_impact,temporal_anomaly_index,solar_flare_intensity,alien_signal_strength,cybernetic_enhancement_level,colony_population_growth,asteroid_impact_squared,solar_flare_interaction,sex,brain_wave,given_name,surname,zipcode,date_of_birth,state_of_origin,colony_name
0,0.822545,1.057711,-0.601707,1.852278,-0.013497,4.888436,0.67658,-0.494931,egg,C,Melissa,Gomez,4284,1974-11-08,GA,Director Station 44
1,-1.200296,,-0.792521,-0.114736,,-31.008129,1.440711,0.95126,egg,A,Michelle,Schwartz,84907,1931-12-27,WI,Attorney Settlement
2,0.196861,1.328186,-1.220844,0.208864,-1.95967,-260.892213,0.038754,-0.240337,seed,B,Bobby,Michael,21407,1988-06-20,DE,Magazine Raise Outpost
3,0.357015,0.849602,-0.208122,,,38.469749,0.12746,-0.074303,seed,C,Gordon,Smith,37365,1940-11-04,FM,Sea Settlement
4,1.644968,1.366874,-0.275052,-2.301921,-1.515191,41.005408,2.705919,-0.452451,egg,B,Kara,Cardenas,68165,1943-08-28,TN,Enjoy Step Outpost


# Analysis - Part 2