# STA130 Week 01 Homework 

Please see the course [wiki-textbook](https://github.com/pointOfive/stat130chat130/wiki) for the list of topics covered in this homework assignment, and a list of topics that might appear during ChatBot conversations which are "out of scope" for the purposes of this homework assignment (and hence can be safely ignored if encountered)


<details class="details-example"><summary style="color:blue"><u>Introduction</u></summary>

### Introduction
    
A reasonable characterization of STA130 Homework is that it simply defines a weekly reading comprehension assignment. 
Indeed, STA130 Homework essentially boils down to completing various understanding confirmation exercises oriented around coding and writing tasks.
However, rather than reading a textbook, STA130 Homework is based on ChatBots so students can interactively follow up to clarify questions or confusion that they may still have regarding learning objective assignments.

> Communication is a fundamental skill underlying statistics and data science, so STA130 Homework based on ChatBots helps practice effective two-way communication as part of a "realistic" dialogue activity supporting underlying conceptual understanding building. 

It will likely become increasingly tempting to rely on ChatBots to "do the work for you". But when you find yourself frustrated with a ChatBots inability to give you the results you're looking for, this is a "hint" that you've become overreliant on the ChatBots. Your objective should not be to have ChatBots "do the work for you", but to use ChatBots to help you build your understanding so you can efficiently leverage ChatBots (and other resources) to help you work more efficiently.<br><br>

</details>

<details class="details-example"><summary style="color:blue"><u>Instructions</u></summary>

### Instructions
    
1. Code and write all your answers (for both the "Prelecture" and "Postlecture" HW) in a python notebook (in code and markdown cells) 
    
> It is *suggested but not mandatory* that you complete the "Prelecture" HW prior to the Monday LEC since (a) all HW is due at the same time; but, (b) completing some of the HW early will mean better readiness for LEC and less of a "procrastentation cruch" towards the end of the week...
    
2. Paste summaries of your ChatBot sessions (including link(s) to chat log histories if you're using ChatGPT) within your notebook
    
> Create summaries of your ChatBot sessions by using concluding prompts such as "Please provide a summary of our exchanges here so I can submit them as a record of our interactions as part of a homework assignment" or, "Please provide me with the final working verson of the code that we created together"
    
3. Save your python jupyter notebook in your own account and "repo" on [github.com](github.com) and submit a link to that notebook though Quercus for assignment marking<br><br>

</details>

<details class="details-example"><summary style="color:blue"><u>Prompt Engineering?</u></summary>

### Prompt Engineering?    
    
The questions (as copy-pasted prompts) are designed to initialize appropriate ChatBot conversations which can be explored in the manner of an interactive and dynamic textbook; but, it is nonetheless **strongly recommendated** that your rephrase the questions in a way that you find natural to ensure a clear understanding of the question. Given sensible prompts the represent a question well, the two primary challenges observed to arise from ChatBots are 

1. conversations going beyond the intended scope of the material addressed by the question; and, 
2. unrecoverable confusion as a result of sequential layers logial inquiry that cannot be resolved. 

In the case of the former (1), adding constraints specifying the limits of considerations of interest tends to be helpful; whereas, the latter (2) is often the result of initial prompting that leads to poor developments in navigating the material, which are likely just best resolve by a "hard reset" with a new initial approach to prompting.  Indeed, this is exactly the behavior [hardcoded into copilot](https://answers.microsoft.com/en-us/bing/forum/all/is-this-even-normal/0b6dcab3-7d6c-4373-8efe-d74158af3c00)...

</details>

# ChatBot Session Summaries
1. https://chatgpt.com/share/df3f6b7f-cac7-4f7d-9c31-37b638a1ec27
2. https://chatgpt.com/share/dd384e3e-8e7b-4b9e-89d4-8d116ea05b8d
3. https://chatgpt.com/share/bad80848-9ae2-495e-acb6-350a3d848bee
4. https://chatgpt.com/share/66e382e9-0750-800e-bf38-22e62151c201





## Summary for 1
Here's a summary of our exchanges that you can submit as part of your homework assignment:

---

### Summary of ChatGPT Interaction on Pandas, Dataframes, and Titanic Dataset

**1. Loading the Titanic Dataset:**
- We loaded the Titanic dataset using Pandas with the command `pd.read_csv()`.
- The dataset was stored in a DataFrame named `df` for further analysis.

**2. Determining the Shape of the Dataset:**
- We used `df.shape` to find the number of rows and columns in the dataset. This command provides a tuple in the form `(number_of_rows, number_of_columns)`, giving an overview of the dataset’s size.

**3. Generating Simple Summaries of Columns:**
- You wanted to generate summary statistics, and we discussed how to use `df.describe()` to get descriptive statistics for numeric columns. To include both numeric and categorical columns, we used `df.describe(include='all')`.

**4. Understanding Discrepancies Between `df.shape` and `df.describe()`:**
- We explored why the number of rows and columns reported by `df.shape` might differ from the statistics in `df.describe()`. The two key reasons are:
  - **Missing Data**: The "count" in `df.describe()` reflects only non-null values, so columns with missing data will show lower counts than the total number of rows.
  - **Column Types**: `df.describe()` by default only provides summaries for numeric columns. It ignores categorical columns unless specified with `include='all'`.

**5. Value Counts for Categorical Data:**
- You asked how to provide a summary of categorical columns, and I introduced the method `df['column'].value_counts()`. This provides the frequency of unique values in categorical columns like `sex`, `pclass`, or `embarked`.

**6. Difference Between Attributes and Methods:**
- We discussed the difference between **attributes** and **methods** in Pandas:
  - **Attributes** (e.g., `df.shape`, `df.columns`) provide information about the DataFrame and do not require parentheses because they don't perform actions.
  - **Methods** (e.g., `df.describe()`, `df.head()`) perform actions on the DataFrame and require parentheses to execute the function.

**7. Downloading Files from GitHub:**
- I explained how to download raw data files from GitHub by using the "Raw" button, obtaining the direct URL, and using it with `pd.read_csv()` to load the dataset.

**8. Definitions of Observations and Variables:**
- **Observations** refer to individual data points or rows in a dataset, where each row corresponds to one instance of data (e.g., a passenger in the Titanic dataset).
- **Variables** refer to attributes or characteristics of the data, represented by columns in a dataset (e.g., age, fare, class, etc.).

---

This summary captures the key points of our interactions. Let me know if you need further adjustments!





## Summary for 2
Here is a summary of our interactions regarding handling missing data and data cleaning in the Titanic dataset, which you can submit as part of your homework assignment:

---

### Summary of Interactions

**1. Understanding Summary Statistics in `df.describe()`:**
   - We discussed key summary statistics such as:
     - **Count**: Number of non-null entries in each column.
     - **Mean**: Average value of the column.
     - **Std**: Standard deviation, representing the dispersion of values.
     - **Min**: Minimum value in the column.
     - **25% (First Quartile)**: Value below which 25% of the data falls.
     - **50% (Median)**: Middle value of the dataset.
     - **75% (Third Quartile)**: Value below which 75% of the data falls.
     - **Max**: Maximum value in the column.

---

**2. Use Case for `df.dropna()` vs. `del df['col']`:**

- **Using `df.dropna()`**:
  - **Use Case**: When you want to handle missing values in key columns without removing entire columns. For instance, dropping rows with missing values in important columns like `age` or `fare` in the Titanic dataset while retaining relevant data for analysis.
  - **Example**: Removing rows with missing data in the `age` column to analyze the survival rate by age while keeping other columns intact.

- **Using `del df['col']`**:
  - **Use Case**: When there are columns irrelevant to your analysis or model, you can remove them to simplify the dataset.
  - **Example**: Removing columns like `name`, `ticket`, and `cabin` that may not be useful for predictive analysis in the Titanic dataset.

---

**3. Opposite Use Case for `del df['col']`:**
   - **Removing Columns Before Handling Missing Values**:
     - Removing unnecessary columns before handling missing data can improve efficiency, especially if the irrelevant columns contain missing values that could otherwise lead to unnecessary data loss.
     - **Example**: Removing `name`, `ticket`, and `cabin` columns before using `df.dropna()` to avoid dropping rows based on irrelevant columns.

---

**4. Efficient Data Cleaning Approach for the Titanic Dataset**:

1. **Load the Data**:
   ```python
   import pandas as pd
   url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
   df = pd.read_csv(url)
   ```

2. **Initial Report on Missing Data**:
   ```python
   df.isna().sum()
   ```

3. **Remove Irrelevant Columns**:
   ```python
   del df['name']
   del df['ticket']
   del df['cabin']
   ```

4. **Handle Missing Values with `df.dropna()`**:
   ```python
   df_cleaned = df.dropna()
   ```

5. **Final Report**: After cleaning the dataset, you can check again for missing data.
   ```python
   df_cleaned.isna().sum()
   ```

**Justification**:
- This approach ensures the efficient removal of irrelevant columns, which reduces the data size and allows for a focused cleaning process by only handling missing values in the relevant columns.

---

This summary outlines the steps, use cases, and justifications we discussed for handling missing data in the Titanic dataset, making it suitable for submission as a record of our interactions for your homework assignment.





## Summary for 3
Here’s a summary of our interactions that you can use for your homework assignment:

---

### Summary of Interactions:

1. **Dataset Overview**: 
   - You uploaded the Titanic dataset and asked for an example of when to use `df.dropna()` versus deleting a column using `del df['col']`. We explored scenarios where each would be appropriate.
     - **Use of `df.dropna()`**: Preferable when you want to remove rows with missing values in specific columns (e.g., removing rows where the `deck` column has `NaN` values but keeping other columns intact).
     - **Use of `del df['col']`**: Preferable when you want to completely remove an irrelevant column, especially if the column has too many missing values (e.g., if the `deck` column has too many `NaN`s, and the data isn’t needed for analysis).

2. **Opposite Use Case Example**:
   - We discussed an opposite scenario where `del df['col']` is preferred over `df.dropna()`. For instance, if the `deck` column is largely irrelevant and full of missing data, it’s better to delete the entire column to retain all rows. Removing rows with missing `deck` values using `df.dropna()` could result in significant data loss.

3. **Order of Operations**: 
   - We discussed why it’s important to apply `del df['col']` before `df.dropna()`. Deleting irrelevant columns first ensures that rows aren’t mistakenly dropped because of missing values in those columns. This preserves as much of the dataset as possible.

4. **KeyError Issue**:
   - You encountered a `KeyError` when referencing or deleting a column that likely did not exist. We identified that it might be due to a typo, case sensitivity, or the column simply not being present. I suggested checking column names before attempting to delete them.

5. **Removing Missing Data from the Dataset**:
   - You requested a strategy to remove all missing data from the dataset using both `del df['col']` and `df.dropna()`.
     - **Approach**: 
       1. Delete irrelevant or mostly missing columns (e.g., `deck`).
       2. Use `df.dropna()` to remove rows that contain missing values in the remaining columns.
     - **Justification**: This combination minimizes data loss while ensuring the dataset is clean and relevant for analysis.

6. **Before and After Report**:
   - I provided a Python code snippet that you can use to compare the dataset before and after cleaning, reporting the number of missing values in each column and the shape of the dataset before and after applying the cleaning steps.

---

This summary should provide a clear record of the key points we covered and the approach to cleaning your dataset. Let me know if you need any additional details!





## Summary for 4
Sure! Here's a summary of our interactions:

---

### Summary of Interactions

1. **Importing the Titanic Dataset**:
   - **User Action**: Attempted to load the Titanic dataset using `pandas`.
   - **Code**: `import pandas as pd` and `url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"`, followed by `df = pd.read_csv(url)`.
   - **Issue**: Encountered issues with the URL at first but corrected it to the right one.

2. **Using `groupby` and `describe`**:
   - **User Action**: Attempted to group the DataFrame by columns (like `sex`) and compute descriptive statistics.
   - **Code**: `df.groupby("sex")["age"].describe()`.
   - **Issue**: Encountered errors due to misspelling `groupby` as `group_by` and forgetting to use quotes around the column names.

3. **Fixing NameErrors**:
   - **User Action**: Faced multiple `NameError`s, such as referencing column names like `sex` and `age` without quotes.
   - **Code Correction**: The correct way is to reference column names using quotes, e.g., `df.groupby("sex")["age"].describe()`.
   - **Resolution**: Errors were resolved by ensuring all column names were enclosed in quotes and spelled correctly.

4. **Handling Case Sensitivity in Column Names**:
   - **User Action**: Encountered `KeyError` due to incorrect capitalization of column names (`Sex` instead of `sex`, and `Age` instead of `age`).
   - **Code**: `df.groupby("sex")["age"].describe()`.
   - **Issue**: The column names are case-sensitive, so the correct usage is `"sex"` and `"age"`.
   - **Resolution**: Ensured column names were referenced with the correct case.

---

This summary outlines the key points and issues you encountered during our interactions, as well as the solutions to those problems. If you need additional details or further clarification, feel free to ask!


### Marking Rubric (which may award partial credit) 

- [0.1 points]: All relevant ChatBot summaries [including link(s) to chat log histories if you're using ChatGPT] are reported within the notebook
- [0.2 points]: Reasonable well-written general definitions for Question "2.2"
- [0.3 points]: Demonstrated understanding regarding Question "4"
<!-- - [0.2 points]: A sensible justification for the choice in Question "7.4" -->
- [0.4 points]: Requested assessment of ChatBot versus google performance in Question "8.3"


## "Pre-lecture" HW [*completion prior to next LEC is suggested but not mandatory*]

#### 1. Pick one of the datasets from the ChatBot session(s) of the **TUT demo** (or from your own ChatBot session if you wish) and use the code produced through the ChatBot interactions to import the data and confirm that the dataset has missing values<br>

<details class="details-example"><summary style="color:blue"><u>Further Guidance</u></summary>

> If your TA has not shared a relevant ChatBot session from their **TUT demo** through a piazza post and a Quercus announcement, the **TUT notebook** has links to example ChatBot sessions that you can use; or, ...
> 
> ```python
> # feel free to just use the following if you prefer...
> import pandas as pd
> url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv"
> df = pd.read_csv(url)
> df.isna().sum()
> ```
    
</details>

In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv"
df = pd.read_csv(url)
df.isna().sum()

row_n           0
id              1
name            0
gender          0
species         0
birthday        0
personality     0
song           11
phrase          0
full_id         0
url             0
dtype: int64

#### 2. Start a new ChatBot session with an initial prompt introducing the dataset you're using and request help to determine how many columns and rows of data a `pandas` DataFrame has, and then

1. use code provided in your ChatBot session to print out the number of rows and columns of the dataset; and,  
2. write your own general definitions of the meaning of "observations" and "variables" based on asking the ChatBot to explain these terms in the context of your dataset<br>

<details class="details-example"><summary style="color:blue"><u>Further Guidance</u></summary>

> A good initial prompt to start would be be something like
> - "I've downloaded a dataset about characters from animal crossings (from https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv), and I'd like to know what columns of information I have and how much data I have"
> 
> You can further reduce the scope of your inquiry with if needed with something like
> - "I've already downloaded the data and want to understand the size (or dimensions) of the dataset to start with"
> 
> *Some ChatBots can upload your data and do this for you; but, extended usage of this feature [likely requires a paid subscription](https://github.com/pointOfive/stat130chat130/blob/main/CHATLOG/wk1/GPT/SLS/00006_gpt3p5_LoadDataPaywall.md); and, anyway, you need to run the code yourself rather than having a ChatBot do that for you; and, for STA130 we don't want a ChatBot to just do the analysis for us; rather, we instead want ChatBots to help us understand the steps we need to take to analyze the data; so,* **you DO NOT need to purchase an upgraded version of any ChatBots**
> - Free-tier level ChatBots like [GPT4o-mini](https://chat.openai.com/) or [Copilot](https://copilot.microsoft.com/) (which is partially based on [ChatGPT4.0](https://chat.openai.com/), and which you have access to through your UofT account) are sufficiently sophisticated and perfectly appropriate for the STA130 course
    
</details>

In [12]:
# To get the number of rows and columns
rows, columns = df.shape
print(f"The DataFrame has {rows} rows and {columns} columns.")

The DataFrame has 891 rows and 15 columns.


An observation is a record or single data point in the dataset. In other words, observations are the rows in the dataset. Observations represent individual cases or subjects under study. In the villager dataset, each observation (row) represents an individual villager (essentially acting like a villager's individual record). For example, you could see a certain villager's birthday, gender, and more by looking at a certain row.

A variable is a characteristic or feature of the data being collected. Variables are the columns in the dataset. Each column in the dataset represents a specific type of data which is collected for each observation (in the villager dataset, each column represents a characteristic recorded for all villagers such as gender or name).

#### 3. Ask the ChatBot how you can provide simple summaries of the columns in the dataset and use the suggested code to provide these summaries for your dataset<br>

<details class="details-example"><summary style="color:blue"><u>Further Guidance</u></summary>

> Use your ChatBot session to help you create working examples of using  `df.describe()` and `df['column'].value_counts()` for your dataset (although note that the `.value_counts()` method is not really meant to be used for numeric variables, so if you dataset has only numeric variables, `.value_counts()` might not be particularly informative...)
>
> #### ChatBot Response Scope 
>     
> If prompts are not sufficiently focused you will likely get overly broad responses from the ChatBot, but you can always respond with subsequent refinement requests to appropriately limit the scope of the ChatBot responses to focus on addressing your actual content targets; so, 
> - an initially very general inquiry like, "I need help analyzing my data" will likely result in a ChatBot response suggesting a wide variety of approaches and techniques for summarizing your dataset; but, re-prompting the ChatBot with something like, "What's the simplest form of summarization of this dataset that I could do and how do I do it in Python?" or suggesting guidance using the specific summarization methods requested above will helpfully re-orient the ChatBot to your specific interests and needs
> 
> #### Jupyter Notebook Hints
> 
> Jupyter notebook printouts usaully don't show all of the data (when there's too much to show, like if `df.describe()` includes results for many columns), but the printouts just show enough of the data to give an idea of what the results are which is all we're looking for at the moment
> 
> - Consider dividing the code that ChatBot provides you into different jupyter notebook cells so that each cell concludes with a key printed result; the last line of code in a jupyter notebook cell will automatically print out in a formatted manner, so replacing something like `print(df.head())` with `df.head()` as the last line of a cell provides a sensible way to organize your code
> - The printout suggestions above are demonstrated in `STA130F24_CourseProject.ipynb` if looking at an example would be helpful to understand what they're getting at...
    
</details>

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 391 entries, 0 to 390
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   row_n        391 non-null    int64 
 1   id           390 non-null    object
 2   name         391 non-null    object
 3   gender       391 non-null    object
 4   species      391 non-null    object
 5   birthday     391 non-null    object
 6   personality  391 non-null    object
 7   song         380 non-null    object
 8   phrase       391 non-null    object
 9   full_id      391 non-null    object
 10  url          391 non-null    object
dtypes: int64(1), object(10)
memory usage: 33.7+ KB


In [12]:
df.shape

(391, 11)

In [14]:
df['species'].value_counts()

species
cat          23
rabbit       20
frog         18
squirrel     18
duck         17
dog          16
cub          16
pig          15
bear         15
mouse        15
horse        15
bird         13
penguin      13
sheep        13
elephant     11
wolf         11
ostrich      10
deer         10
eagle         9
gorilla       9
chicken       9
koala         9
goat          8
hamster       8
kangaroo      8
monkey        8
anteater      7
hippo         7
tiger         7
alligator     7
lion          7
bull          6
rhino         6
cow           4
octopus       3
Name: count, dtype: int64

In [18]:
df['personality'].value_counts()

personality
lazy      60
normal    59
cranky    55
snooty    55
jock      55
peppy     49
smug      34
uchi      24
Name: count, dtype: int64

In [17]:
df.describe(include='all')

Unnamed: 0,row_n,id,name,gender,species,birthday,personality,song,phrase,full_id,url
count,391.0,390,391,391,391,391,391,380,391,391,391
unique,,390,391,2,35,361,8,92,388,391,391
top,,admiral,Admiral,male,cat,1-27,lazy,K.K. Country,wee one,villager-admiral,https://villagerdb.com/images/villagers/thumb/...
freq,,1,1,204,23,2,60,10,2,1,1
mean,239.902813,,,,,,,,,,
std,140.702672,,,,,,,,,,
min,2.0,,,,,,,,,,
25%,117.5,,,,,,,,,,
50%,240.0,,,,,,,,,,
75%,363.5,,,,,,,,,,


#### 4. If the dataset you're using has (a) non-numeric variables and (b) missing values in numeric variables, explain (perhaps using help from a ChatBot if needed) the discrepancies between size of the dataset given by `df.shape` and what is reported by `df.describe()` with respect to (a) the number of columns it analyzes and (b) the values it reports in the "count" column<br>

<details class="details-example"><summary style="color:blue"><u>Further Guidance</u></summary>

> If the dataset you're using does not have (a) non-numeric variables and (b) missing values in numeric variables (e.g., the `"villagers.csv"` example above has only a single numeric variable `row_n` which has no missing values), instead download and use the [https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv](https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv)" data to answer this question  
>
> In (a) above, the "columns it analyzes" refers to the columns of the output of `df.describe()` which will only include "numeric" columns by default, but you can can see the names of all the columns in a dataset using `df.columns`; and, make sure `df.shape` is refering to the dataset you think it is... if you've loaded a different dataset it might not have been called `df`(!)
>
> **If you get any errors (for example related to column names), copy and paste them as a response to the ChatBot, and see if it can help you resove them by adding the suggested adjustments to your code and then reruning all your code to see if the changes have fixed the problem (and repeat this process as needed until the problems have been resolved).**
    
</details>

In [16]:
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)
df.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [21]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [17]:
df.shape

(891, 15)

In [18]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [19]:
df.describe(include='all')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
count,891.0,891.0,891,714.0,891.0,891.0,891.0,889,891,891,891,203,889,891,891
unique,,,2,,,,,3,3,3,2,7,3,2,2
top,,,male,,,,,S,Third,man,True,C,Southampton,no,True
freq,,,577,,,,,644,491,537,537,59,644,549,537
mean,0.383838,2.308642,,29.699118,0.523008,0.381594,32.204208,,,,,,,,
std,0.486592,0.836071,,14.526497,1.102743,0.806057,49.693429,,,,,,,,
min,0.0,1.0,,0.42,0.0,0.0,0.0,,,,,,,,
25%,0.0,2.0,,20.125,0.0,0.0,7.9104,,,,,,,,
50%,0.0,3.0,,28.0,0.0,0.0,14.4542,,,,,,,,
75%,1.0,3.0,,38.0,1.0,0.0,31.0,,,,,,,,


The number of columns of this dataset given by df.shape (which gives you a dataset's number of rows and columns) is 15. On the other hand, the number of columns analyzed by df.describe() is only 6. This discrepancy is due to the fact that df.describe() only analyzes numeric columns, such as age or fare, while df.shape tells you the total number of columns (including non-numeric ones). When you call df.describe(include='all') (which includes non-numeric columns), you will be able to see all 15 columns of the Titanic dataset. Moreover, df.shape tells you that there are 891 rows, including rows with null values. However, df.describe() only reports the number of non-null values in the columns. So when you call df.describe(), the "count" column for age is reported as 714 and not 891 since the age column is missing 177 values (891 - 177 = 714).

#### 5. Use your ChatBot session to help understand the difference between the following and then provide your own paraphrasing summarization of that difference

- an "attribute", such as `df.shape` which does not end with `()`
- and a "method", such as `df.describe()` which does end with `()` 
   

<details class="details-example"><summary style="color:blue"><u>Further Guidance</u></summary>

> The fact that a "method" such as `df.describe()` ends with `()` suggests that "methods" are essentially something that we would call a "function" in programming language terminology; but, without getting too technical or "in the weeds", it might also be worth considering that we could also contrast what the difference is between a "function" in a programming language versus a "function" in mathematics...  
    
</details><br><br>

***Don't forget to ask for summaries of your ChatBot session(s) and paste these into your homework notebook (including link(s) to chat log histories if you're using ChatGPT)!***<br><br>

An attribute is a property of an object which does not need to be called and is only accessed (which is why there are no parentheses). Furthermore, attributes usually hold static information about the object such as its dimensions or column names. On the other hand, a method is a function of an object and needs to be invoked (which is why there are parentheses). Unlike attributes, methods perform operations, such as computing summary statistics, of a data frame.

The difference 

<details class="details-example"><summary style="color:blue"><u>Continue now...?</u></summary>

### Prelecture VS Postlecture HW
    
Feel free to work on the "Postlecture" HW below if you're making good progress and want to continue: in this case this is particularly reasonable as questions "6" and "7" below directly follow up and extend the "Prelecture" HW questions

*The benefits of continue would are that (a) it might be fun to try to tackle the challenge of working through some problems without additional preparation or guidance; and (b) this is a very valable skill to be comfortable with; and (c) it will let you build experience interacting with ChatBots (and beginning to understand their strengths and limitations in this regard)... it's good to have sense of when using a ChatBot is the best way to figure something out, or if another approach (such as course provided resources or a plain old websearch for the right resourse) would be more effective*
    
</details>    


## "Post-lecture" HW [*submission along with "Pre-lecture" HW is due prior to next TUT*]

#### 6. The `df.describe()` method provides the 'count', 'mean', 'std', 'min', '25%', '50%', '75%', and 'max' summary statistics for each variable it analyzes. Give the definitions (perhaps using help from the ChatBot if needed) of each of these summary statistics<br>

<details class="details-example"><summary style="color:blue"><u>Further Guidance</u></summary>

> The answers here actually make it obvious why these can only be calculated for numeric variables in a dataset, which should help explain the answer to "4(a)" and "4(b)" above
>   
> Also notice that when `df.describe()` is used missing values are not explicitly removed, but `df.describe()`  provides answers anyway. Is it clear what `df.describe()` does with the data in each columns it analyzes if there is missing data in the column in question? 
>
> The next questions addresses removing rows or columns from a dataset in order to explicitly remove the presense of any missingness in the dataset (assuming we're not going to fill in any missing data values using any missing data imputation methods, which are beyond the scope of STA130); so, the behavior of `df.describe()` hints that explicitly removing missing may not always be necessary; but, the concern, though, is that not all methods may be able to handle missing data the way `df.describe()` does...
    
</details>

Count: This shows you the number of non-null entries in each column of the dataset.

Mean: The average value (sum of values divided by number of values) of the dataset.

Std: A measure of the amount of variation or dispersion in the dataset. A high standard deviation indicates that values are spread out over a wider range, while a low standard deviation indicates that values are close to the mean.

Min: This provides you with the data set's smallest value (the lower bound).

25%: The value below which 25% of the data falls. It is a measure of the lower boundary of the data's central range.

50%: The middle value of the dataset when it is ordered from smallest to largest. Half of the values are below this point, and half are above. It is a measure of central tendency that is less sensitive to outliers compared to the mean.

75%: The value below which 75% of the data falls. It marks the upper boundary of the central range of the data.

Max: This provides you with the data set's largest value (the upper bound).

#### 7. Missing data can be considered "across rows" or "down columns".  Consider how `df.dropna()` or `del df['col']` should be applied to most efficiently use the available non-missing data in your dataset and briefly answer the following questions in your own words

1. Provide an example of a "use case" in which using `df.dropna()` might be peferred over using `del df['col']`<br><br>
    
2. Provide an example of "the opposite use case" in which using `del df['col']` might be preferred over using `df.dropna()` <br><br>
    
3. Discuss why applying `del df['col']` before `df.dropna()` when both are used together could be important<br><br>
    
4. Remove all missing data from one of the datasets you're considering using some combination of `del df['col']` and/or `df.dropna()` and give a justification for your approach, including a "before and after" report of the results of your approach for your dataset.<br><br>

<details class="details-example"><summary style="color:blue"><u>Further Guidance</u></summary>

> Start a new ChatBot session **[but remember to first ask your ChatBot for summaries of your current session and perhaps coding results (so you can supply these in the homework as requested)]**, since your last ChatBot session has likely gotten quite long and has covered a lot of material at this point 
> - It can sometimes be helpful to reset ChatBot sessions to refocus them on the topics of inquiry without too much backlog history that might unintentionally bias things in certain directions and, of course, you can always re-introduce material from earlier conversations as it's relevant, such as for answering "D" based on reintroducing and updating code you made in a previous ChatBot session.  
> 
> #### ChatBot Scope Guidance
> 
> - This question is not interested in the general benefits of imputing missing data, or the general benefits of using `df.dropna()` and/or `del df['col']` to remove missing data, just how to most efficiently remove missing data if a user chooses to do so
> 
> - More sophisticated analyses for "filling in" rather than removing missing data (as considered here) are possible (based on making assumptions about missing data and using specific imputation methods or models) but these are "beyond the scope" of this homework assignment so this topics can be safely ignored for now
> 
> #### ChatBot Code Troubleshooting
> 
> A key issue to be aware of when asking ChatBots for help with something is that they are not running and checking code for correctess, and they often intertwine written instructions with code instructions; so, BEFORE YOU RUN ANY CODE provided by a ChatBot, you should check the following
> 
> 1. If this code changes an object or data, are you sure you want to run this code?
> 2. Can you easily "undo" the results of running code (e.g., from a copy `df_saved=df.copy()` or reloading the data) if running the code doesn't do what you want?
> 3. Is the state of the data what is expected by the code? Or have the objects been updated and changed so they're no longer what the code expects them to be? 
> 
> #### If you get any `Python` errors, copy and paste them into the ChatBot prompt and see if it can help you resove them; but, keep in mind the final point above becasue the ChatBot might not be aware of the state of your objects relative to the code it's producing...

</details><br>

7.1 A use case in which using df.dropna() might be preferred overs using del df['col'] is if I was only trying to get rid of missing values. For example, if I wanted to only include valid and non-missing deck information, I wouldn't use del df['deck'] (which would get rid of the entire column), I would use df.dropna() which would remove any missing data and leave me with non-missing data for deck info.

In [11]:
df.dropna()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


7.2 If I thought that deck information was no longer useful for the analysis (or also because it has many missing values which affects the size of the dataset) I would use del df['deck'] over df.dropna(). This would remove the entire deck column and allow me to preserve more rows in comparison to df.dropna().

In [12]:
df['deck']

0      NaN
1        C
2      NaN
3        C
4      NaN
      ... 
886    NaN
887      B
888    NaN
889      C
890    NaN
Name: deck, Length: 891, dtype: object

In [14]:
df.shape

(891, 15)

7.3 Applying del df['col'] before df.dropna() when both are used together could be important because it might be more efficient and you can avoid losing more data than you need to. For example, if I used del df['deck'] before using df.dropna() for the Titanic dataset, I would first be removing a column that I do not necessarily need (and one that has a high percentage of missing data), then I would remove all missing values from the columns. By doing so, I make sure df.dropna() only operates on necessary columns. Furthermore, I am able to retain more useful data by doing these steps in this order since if I used df.dropna() first I would be losing a lot of rows simply because of the 688 missing values in the deck column (a column that I don't even want anymore). 

In [24]:
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

In [25]:
print(df.columns)

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')


7.4 I would first start off by applying del df['deck'] since the deck column has so many missing values that df.dropna() would make me lose many rows. After doing so, I would then use drop.na() to remove all missing data from the columns (the age column). By applying these steps in this order, I ensure that I am retaining many more rows of data that could be useful in my data analysis and I ensure that the code runs more efficiently.

In [28]:
# Before: Checking missing values in each column
print("Before cleaning:")
print(df.isnull().sum())

# Step 1: Remove columns with significant missing values (like 'deck')
if 'deck' in df.columns:
    del df['deck']

# Step 2: Drop rows with missing values in remaining columns
df_cleaned = df.dropna()

# After: Checking the shape and remaining missing values
print("\nAfter cleaning:")
print(df_cleaned.isnull().sum())
print("Shape of dataset before cleaning:", df.shape)
print("Shape of dataset after cleaning:", df_cleaned.shape)


Before cleaning:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
embark_town      2
alive            0
alone            0
dtype: int64

After cleaning:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64
Shape of dataset before cleaning: (891, 14)
Shape of dataset after cleaning: (712, 14)






    
#### 8. Give brief explanations in your own words for any requested answers to the questions below

> This problem will guide you through exploring how to use a ChatBot to troubleshoot code using the "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv" data set 
> 
> To initialially constrain the scope of the reponses from your ChatBot, start a new ChatBot session with the following slight variation on the initial prompting approach from "2" above
> - "I am going to do some initial simple summary analyses on the titanic data set I've downloaded (https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv) which has some missing values, and I'd like to get your help understanding the code I'm using and the analysis it's performing"
        
1. Use your ChatBot session to understand what `df.groupby("col1")["col2"].describe()` does and then demonstrate and explain this using a different example from the "titanic" data set other than what the ChatBot automatically provide for you
    
> If needed, you can help guide the ChatBot by showing it the code you've used to download the data **AND provide it with the names of the columns** using either a summary of the data with `df.describe()` or just `df.columns` as demonstrated [here](../CHATLOG/COP/00017_copilot_groupby.md)
    
2. Assuming you've not yet removed missing values in the manner of question "7" above, `df.describe()` would have different values in the `count` value for different data columns depending on the missingness present in the original data.  Why do these capture something fundamentally different from the values in the `count` that result from doing something like `df.groupby("col1")["col2"].describe()`?

> Questions "4" and "6" above address how missing values are handled by `df.describe()` (which is reflected in the `count` output of this method); but, `count` in conjunction with `group_by` has another primary function that's more important than addressing missing values (although missing data could still play a role here).

3. Intentionally introduce the following errors into your code and report your opinion as to whether it's easier to (a) work in a ChatBot session to fix the errors, or (b) use google to search for and fix errors: first share the errors you get in the ChatBot session and see if you can work with ChatBot to troubleshoot and fix the coding errors, and then see if you think a google search for the error provides the necessary toubleshooting help more quickly than ChatGPT<br><br>
    
    1. Forget to include `import pandas as pd` in your code 
       <br> 
       Use Kernel->Restart from the notebook menu to restart the jupyter notebook session unload imported libraries and start over so you can create this error
       <br><br>
       When python has an error, it sometimes provides a lot of "stack trace" output, but that's not usually very important for troubleshooting. For this problem for example, all you need to share with ChatGPT or search on google is `"NameError: name 'pd' is not defined"`<br><br>

    2. Mistype "titanic.csv" as "titanics.csv"
       <br> 
       If ChatBot troubleshooting is based on downloading the file, just replace the whole url with "titanics.csv" and try to troubleshoot the subsequent `FileNotFoundError: [Errno 2] No such file or directory: 'titanics.csv'` (assuming the file is indeed not present)
       <br><br>
       Explore introducing typos into a couple other parts of the url and note the slightly different errors this produces<br><br>
      
    3. Try to use a dataframe before it's been assigned into the variable
       <br> 
       You can simulate this by just misnaming the variable. For example, if you should write `df.groupby("col1")["col2"].describe()` based on how you loaded the data, then instead write `DF.groupby("col1")["col2"].describe()`
       <br><br>
       Make sure you've fixed your file name so that's not the error any more<br><br>
        
    4. Forget one of the parentheses somewhere the code
       <br>
       For example, if the code should be `pd.read_csv(url)` the change it to `pd.read_csv(url`<br><br>
        
    5. Mistype one of the names of the chained functions with the code 
       <br>
       For example, try something like `df.group_by("col1")["col2"].describe()` and `df.groupby("col1")["col2"].describle()`<br><br>
        
    6. Use a column name that's not in your data for the `groupby` and column selection 
       <br>
       For example, try capitalizing the columns for example replacing "sex" with "Sex" in `titanic_df.groupby("sex")["age"].describe()`, and then instead introducing the same error of "age"<br><br>
        
    7. Forget to put the column name as a string in quotes for the `groupby` and column selection, and see if the ChatBot and google are still as helpful as they were for the previous question
       <br>
       For example, something like `titanic_df.groupby(sex)["age"].describe()`, and then `titanic_df.groupby("sex")[age].describe()`
        


In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

In [1]:
import pandas as pd

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

# Group by 'pclass' and get descriptive statistics for 'fare'
result = df.groupby("pclass")["fare"].describe()
print(result)

        count       mean        std  min       25%      50%   75%       max
pclass                                                                     
1       216.0  84.154687  78.380373  0.0  30.92395  60.2875  93.5  512.3292
2       184.0  20.662183  13.417399  0.0  13.00000  14.2500  26.0   73.5000
3       491.0  13.675550  11.778142  0.0   7.75000   8.0500  15.5   69.5500


8.1 df.groupby("col1")["col2"].describe() groups the dataframe by unique values in col1, then selects col2 and generates descriptive statistics (such as mean or std) for col2 within each group defined by col1. In the example above, descriptive statistics for fare are generated within each group defined in pclass (1st, 2nd, and 3rd class).

In [31]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


8.2 df.describe() count is the total number of non-null values for each numeric column while df.groupby("col1")["col2"].describe() count is the number of non-null values within each group in col1 (since col1 may have several distinct values, the non-null values count differs for each group). 

Question 8.3 a)

In [1]:
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

NameError: name 'pd' is not defined

For this error, I would say it was easier and much faster to use ChatGPT since it explained and gave me the solution to copy immediately whereas when I Googled the error, I needed to actually take some time to go through some websites and find a solution.

What ChatGPT told me:
"The NameError: name 'pd' is not defined indicates that Python doesn't recognize pd because pandas hasn't been imported or the import statement is missing. To resolve this error, make sure you have imported pandas with the alias pd before using it."

Question 8.3 b) 

In [4]:
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanics.csv"
df = pd.read_csv(url)

HTTPError: HTTP Error 404: Not Found

I would say once again that it's much faster and better to use ChatGPT to find the solution to my problem. Copying and pasting the error line into Google didn't even give me any websites or solutions!

What ChatGPT told me:
"The HTTPError indicates there was an issue with accessing the URL. The URL you've provided seems to have a typo; it should be titanic.csv instead of titanics.csv."

Question 8.3 c)

In [16]:
DF.groupby("col1")["col2"].describe()

NameError: name 'DF' is not defined

This time, it was closer between ChatGPT and Google. However, I would say that ChatGPT was just a bit quicker and gave me a slightly clearer and more concise answer than Google.

What ChatGPT told me:
"It seems like you're encountering a NameError related to the df variable not being defined. This could happen if the DataFrame df hasn't been created or if it was defined with a different variable name."

Question 8.3 d)

In [5]:
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanics.csv"
df = pd.read_csv(url

SyntaxError: incomplete input (2450341457.py, line 3)

Using ChatGPT was much easier than Google this time. With Google, none of the websites gave me a solution that related to the error above. However, with ChatGPT, I was given the exact solution right away.

What ChatGPT told me:
"The SyntaxError: incomplete input typically occurs when there is a missing parenthesis or other syntax error in your code. In your case, it looks like there is a missing closing parenthesis in the pd.read_csv(url) line."

Question 8.3 e)

In [5]:
df.group_by("pclass")["age"].describe()

AttributeError: 'DataFrame' object has no attribute 'group_by'

In [7]:
df.groupby("pclass")["age"].describle()

AttributeError: 'SeriesGroupBy' object has no attribute 'describle'

Once again, ChatGPT quickly provided me with solutions, while Google only confused me with solutions that did not really apply to the above error.

What ChatGPT told me:
"The error you're encountering is due to a typo in the method name. The correct method name is groupby (not group_by)."

"It seems you've encountered an AttributeError because of a typo: 'describle' should be 'describe'."

Question 8.3 f)

In [12]:
df.groupby("Sex")["age"].describe()

KeyError: 'Sex'

In [9]:
df.groupby("sex")["Age"].describe()

KeyError: 'Column not found: Age'

Google actually gave me a solution a bit more quickly than ChatGPT ("a type in the column name"), but ChatGPT told me the exact solutions for the errors. So I would still say ChatGPT did better (just by a bit though).

What ChatGPT told me:
"The KeyError: 'Sex' means that the column "Sex" does not exist in the DataFrame, possibly because the column name is spelled differently or uses different capitalization. In this dataset, the correct column name is likely "sex" (all lowercase)."

"The KeyError: 'Column not found: Age' means that the column name "Age" does not exist in the DataFrame. The issue is likely caused by incorrect capitalization. Column names in pandas are case-sensitive, so "Age" is not the same as "age"."

Question 8.3 g)

In [13]:
df.groupby(sex)["age"].describe()

NameError: name 'sex' is not defined

In [11]:
df.groupby("sex")[age].describe()

NameError: name 'age' is not defined

ChatGPT did better this time as it told me the exact solutions, while Google gave me solutions unrelated to the errors above (there were solutions, but not solutions useful to me).

What ChatGPT told me:
"The NameError: name 'sex' is not defined indicates that you're trying to reference sex without enclosing it in quotes, which causes Python to interpret it as a variable name rather than a column name."

"The NameError: name 'age' is not defined is occurring because age is being referenced without quotes. Like sex, column names in pandas should always be enclosed in quotes when you're accessing them."

#### 9. Have you reviewed the course [wiki-textbook](https://github.com/pointOfive/stat130chat130/wiki) and interacted with a ChatBot (or, if that wasn't sufficient, real people in the course piazza discussion board or TA office hours) to help you understand all the material in the tutorial and lecture that you didn't quite follow when you first saw it?<br>
    
<details class="details-example"><summary style="color:blue"><u>Further Guidance</u></summary>

> Just answering "Yes" or "No" or "Somewhat" or "Mostly" or whatever here is fine as this question isn't a part of the rubric; but, the midterm and final exams may ask questions that are based on the tutorial and lecture materials; and, your own skills will be limited by your familiarity with these materials (which will determine your ability to actually do actual things effectively with these skills... like the course project...)
    
</details>
    
***Don't forget to ask for summaries of your ChatBot session(s) and paste these into your homework notebook (including link(s) to chat log histories if you're using ChatGPT)!***

Yes!!!

# Recommended Additional Useful Activities [Optional]

The "Ethical Profesionalism Considerations" and "Current Course Project Capability Level" sections below **are not a part of the required homework assignment**; rather, they are regular weekly guides covering (a) relevant considerations regarding professional and ethical conduct, and (b) the analysis steps for the STA130 course project that are feasible at the current stage of the course<br><br>

<details class="details-example"><summary style="color:blue"><u>Ethical Professionalism Considerations</u></summary>

## Ethical Professionalism Considerations

> If the observed data is "no events occured" does this mean the data is "missing" and [should be ignored](https://priceonomics.com/the-space-shuttle-challenger-explosion-and-the-o)?
> 
> - NASA: \<determines temperature doesn't affects "o-ring" by subseting data to just "o-ring" incidents\>
> - Also NASA: \<launches the shuttle on a cold day\>

|No apparent "o-ring" failure and temperature relationship|Apparent between "o-ring" failure and temperature relationship|
|:-|:-|
if you just look at "o-ring" failure event data|if you instead look at ALL the data as you should|
|![](https://etzq49yfnmd.exactdn.com/wp-content/uploads/2022/03/image06-14.png)|![](https://etzq49yfnmd.exactdn.com/wp-content/uploads/2022/03/image02-33.png)|
|![](https://upload.wikimedia.org/wikipedia/commons/8/8b/Shuttle_Challenger_explosion.gif?20190203170223)|![](https://i.makeagif.com/media/10-04-2014/nT57xW.gif)|

<br>
    
</details>    

<details class="details-example"><summary style="color:blue"><u>Current Course Project Capability Level</u></summary>

## Current Course Project Capability Level

> The data we'll use for the STA130 course project is based on the [Canadian Social Connection Survey](https://casch.org/cscs). Please see the [data use agreement](https://static1.squarespace.com/static/60283c2e174c122f8ebe0f39/t/6239c284d610f76fed5a2e69/1647952517436/Data+Use+Agreement+for+the+Canadian+Social+Connection+Survey.pdf) regarding the appropriate and ethical professional use of this data (available at the bottom of the [CSCS](https://casch.org/cscs) webpage).
> 
> 1. Have a very quick look at the list of available variables using the [link](https://drive.google.com/file/d/1ISVymGn-WR1lcRs4psIym2N3or5onNBi/view) (again at the bottom of the [CSCS](https://casch.org/cscs) webpage); then, 
> 2. examine the code in the first thirteen code cells of [STA130F24_CourseProject.ipynb](https://github.com/pointOfive/stat130chat130/blob/main/CP/STA130F24_CourseProject.ipynb) to get an initital understanding of how we might subset to different studies included in the [data](https://drive.google.com/file/d/1mbUQlMTrNYA7Ly5eImVRBn16Ehy9Lggo/view) (again accessible at the bottom of the [CSCS](https://casch.org/cscs) webpage); then,     
> 3. review the fourteenth and fifteenth cells (with the comments "Here's a high level summary of the data" and "And here are some explanations about the columns in the data") a little more closely to get a better sense of which columns seem to be the most interesting and whether or not they seem to have a lot of missing data
    
</details>        

### Afterward

Here are few ideas of some other kinds of interactions you might consider exploring with a ChatBot...

> While these are likely to be extremely practically valuable, they are not a part of the homework assignment, so do not include anything related to these in your homework submission

- With respect to improving ones ability in statistics, coding, communication, and other key data science skills
    - what is the ChatBots perception its own capabilities and uses as an AI-driven assistance tool 
    - and does ChatBots assessment of itself influence or agree with your own evalution of the ChatBot? 

- ChatBots can introduce and explain the "World War 2 planes" problem and the "Monte Hall" problem... 
    - how well does do they seem to do and introducing and explaining other "unintuitive surprising statistics paradoxes"?

- If you consider the process of writing about why you chose to take this course, and the skills you were hoping to build through this course with respect to your current ideas about what possible careers 
    - and how do you think the exercise would be different if you framed it as a dialogue with a ChatBot
    - and do you think the difference could be positive and productive, or potentially biasing and distracting?
    
- ChatBots sometimes immediately responds in simple helpful ways, but other times it gives a lot of extraneous information that can be overwheling... are you able to prompt and interact with ChatBots in manner that keeps its reponses helpful and focused on what you're interested in? 

- ChatBots tends to respond in a fairly empathetic and supportive tone...
    - do you find it helpful to discuss concerns you might have about succeeding in the course (or entering university more generally) with a ChatBot?
    
- For what purposes and in what contexts do you think a ChatBot could provide suggestions or feedback about your experiences that might be useful? 
