In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)
df.isna().sum()


survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [8]:
import pandas as pd

# Load the dataset (you can replace this with your own dataset)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
titanic_data = pd.read_csv(url)

# Get the number of rows and columns
rows, columns = titanic_data.shape

print(f"The DataFrame has {rows} rows and {columns} columns.")


The DataFrame has 891 rows and 15 columns.


### 2.2 Observations and Variables

**Observations**:
   - Observations represent individual units of data or records in the dataset. Each row in the dataset corresponds to one observation. In this dataset, each row represents a single  passenger on the Titanic. A passenger is an observation which includes their entire row.
   
**Variables**:
   - Variables are the characteristics or attributes recorded for each observation. In a DataFrame, variables correspond to the columns. Each column represents characteristics of the passengers on the Titanic in this dataset, like their age or sex.

In [7]:
import pandas as pd

# Load the Titanic dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)

# 1. Summarize the dataset with df.describe()
summary = df.describe()
print("Summary Statistics:")
print(summary)

# 2. Count unique values in a specific column (for example, 'sex' or 'class')
value_counts_sex = df['sex'].value_counts()
value_counts_class = df['class'].value_counts()

print("\nValue Counts for 'sex' column:")
print(value_counts_sex)

print("\nValue Counts for 'class' column:")
print(value_counts_class)



Summary Statistics:
         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

Value Counts for 'sex' column:
sex
male      577
female    314
Name: count, dtype: int64

Value Counts for 'class' column:
class
Third     491
First     216
Second    184
Name: count, dtype: int64


### 4. Summary of Discrepancies

- (a) Number of columns analyzed: df.describe() by default only summarizes numerical columns, so fewer columns are included compared to the total number of columns in df.shape.
- (b) "count" values: The "count" in df.describe() shows the number of non-null values in each column, which might be less than the total number of rows due to missing data.

### 5. The Difference between an Attribute and a Method

- An attribute is a characteristic of an object and it usually holds some form of data about that object. df.shape for example does not require parentheses because it is like a variable tied to that object and it doesn't perform any calculations, it just gives us information about the object (like number of rows or columns).
- A method on the other hand, is a function that performs some sort of calculation on the data. df.describe() requires parentheses and it performs statistical calculations with the data like providing the mean, median, etc.

### 6. Summary Statistics Definitions


- Count - Number of non-null values. It shows how many non-missing data points exist in each column
- Mean - Average of the values. It shows the central tendency but can be skewed by outliers
- Std (Standard Deviation) - Measure of the spread of the data, or how much values vary from the mean. It helps show the variability in the dataset.
- Min	Smallest value	Shows the lowest value in the column
- 25% (1st Quartile) - Value below which 25% of the data falls (Q1). Describes the lower part of the distribution
- 50% (Median) - Middle value (half the data is below, half above). Provides a robust measure of central tendency, less affected by outliers
- 75% (3rd Quartile) - Value below which 75% of the data falls (Q3). Describes the upper part of the distribution
- Max - Largest value. Shows the highest value in the column

### 7. Missing Data

- 1. An example where df.dropna() may be used instead of del df['col'] is when you would want to exclude all the rows with missing values in the age column but preserve the rest of the DataFrame
- 2. del df['col'] is useful for deleting an entire column that has too many missing values. The cabin column has too many missing values for it to be useful so df['col'] will delete this column.
- 3. Deleting the columns that are unnecessary due to high number of missing values first is important since then, the df.dropna() can focus on columns that matter more and are more useful. If you do it in the opposite order, performing df.dropna() would be useless on columns that would later be removed anyway.


In [10]:
### 7.4 Before

import pandas as pd

# Load the Titanic dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)

# Initial shape of the DataFrame
print("Initial shape of the DataFrame:", df.shape)
print("\nMissing values in each column:\n", df.isna().sum())


Initial shape of the DataFrame: (891, 15)

Missing values in each column:
 survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


Before Cleaning:

Shape: (891, 15) (Initial number of rows and columns)
Missing values: Some columns, such as age and deck, have missing values. They are important, however, so they must be kept.

In [6]:
import pandas as pd

# Load the Titanic dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)

# Initial shape of the DataFrame
print("Initial shape of the DataFrame:", df.shape)

# Drop rows where 'age' has missing values
df_cleaned_age = df.dropna(subset=['age'])

# Drop rows where 'deck' has missing values
df_cleaned_deck = df.dropna(subset=['deck'])

# Shape after cleaning
print("\nShape of the DataFrame after dropping rows with missing 'age':", df_cleaned_age.shape)
print("\nMissing values in each column after cleaning:\n", df_cleaned_age.isna().sum())
print("\nShape of the DataFrame after dropping rows with missing 'deck':", df_cleaned_deck.shape)
print("\nMissing values in each column after cleaning:\n", df_cleaned_deck.isna().sum())

Initial shape of the DataFrame: (891, 15)

Shape of the DataFrame after dropping rows with missing 'age': (714, 15)

Missing values in each column after cleaning:
 survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           530
embark_town      2
alive            0
alone            0
dtype: int64

Shape of the DataFrame after dropping rows with missing 'deck': (203, 15)

Missing values in each column after cleaning:
 survived        0
pclass          0
sex             0
age            19
sibsp           0
parch           0
fare            0
embarked        2
class           0
who             0
adult_male      0
deck            0
embark_town     2
alive           0
alone           0
dtype: int64


### 7.4 Justification

I used df.dropna() to drop the missing values in 'age' and 'deck' columns but I didn't delete the rows entirely since they are important to the dataset hence why I didn't use del df['col']. Dropping the missing values helps clean the dataset and makes it easier to analyse.

### 8. Analysis

1. 
The code df.groupby("col1")["col2"].describe() performs group-wise summary statistics on a specific column of the DataFrame, grouped by another column.

- df.groupby("col1"): Groups the DataFrame by the unique values in "col1". This creates a group for each unique value in the "col1" column.

- ["col2"]: Selects the "col2" column for aggregation. After grouping by "col1", this specifies that you want to perform operations on the "col2" column within each group.

- .describe(): Generates descriptive statistics for the "col2" column within each group. This includes count, mean, standard deviation, minimum, maximum, and percentiles.



In [14]:
import pandas as pd

# Load the Titanic dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)

# Display the first few rows to understand the structure
print("First few rows of the dataset:")
print(df.head())

# Group by 'pclass' and describe 'age'
grouped_description = df.groupby("pclass")["age"].describe()

# Display the result
print("\nDescriptive statistics for 'age' grouped by 'pclass':")
print(grouped_description)


First few rows of the dataset:
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

Descriptive statistics for 'age' grouped by 'pclass':
        count       mean        std   min   25%   50%   75%   max
pclass                                                           
1

8.2 
- the count in df.describe() gives an overview of the data’s completeness and distribution across the entire DataFrame. If the DataFrame has any missing values, the count will be less than the total number of rows. 
- the count in df.groupby("col1")["col2"].describe() counts for col2 within each group defined by col1. Each group might have different counts depending on missing values. The count for col2 will be different by group and may highlight how missing values are distributed across different categories in col1. 

8.3 
 <br> I think that troubleshooting errors in ChatGPT is easier than finding the answer to how to fix your code on google. Although Stack Overflow has good solutions to common errors in python, it is still much easier to ask ChatGPT to fix the code instead. ChatGPT knows the error in the context of the dataset that I'm using and gives me a new and improved code. It also explains to me in detail why that error occured and I can continue asking it questions if I still don't understand why that error happened. It helps a lot with catching errors in a simple and efficient way.


### Summary of Chat with ChatGPT

Summary of the Chat
In this conversation, you explored various aspects of analyzing the Titanic dataset using Pandas in Python. Here’s a summary of the key points:

Initial Dataset Overview:

We started by discussing how to load and explore the dataset, including basic methods like df.describe(), which gives summary statistics for numerical columns, and df.value_counts() to understand the distribution of categorical variables.
Handling Missing Values:

You encountered and resolved issues related to missing values, including how to use df.dropna() to remove rows with missing values in specific columns.
You also discussed the implications of using del df['col'] to drop entire columns with significant missing data and how that differs from using df.dropna().
Differences Between Attributes and Methods:

We covered the distinction between attributes like df.shape (which do not require parentheses) and methods like df.describe() (which do).
Descriptive Statistics:

You explored how to use df.groupby("col1")["col2"].describe() to get group-wise descriptive statistics, which is useful when analyzing the relationship between two variables.
Missing Values and Grouping:

We discussed how missing values affect the output of df.describe() versus df.groupby("col1")["col2"].describe(), particularly how the count differs in each case.

***Link to chat log history*** https://chatgpt.com/share/66e38a15-0e10-8007-8a8c-796222b5ef03