# Lecture 1 - Introduction

### DATA 2201, Fall 2025

## Resources 

* Course Website [Canvas](https://mtu.instructure.com/courses/1571598)
    * Where all lectures, assignments, and readings are posted. 
* JupyterHub 
    * Where you will work on lectures and assignments (links on Canvas will take you here)
* Discussion [Ed](https://edstem.org/us/courses/81046/)
    * Where you should ask and answer questions on assignments and concepts
    * Where announcements will be posted (assignment clarifications, course logistics, etc.)
* Submit Assignments [Gradescope](https://www.gradescope.com/courses/1073735)
    * Where all assignments are submitted
    * Where you can view the grading / feedback on your submissions. 
* Textbook and Notes [Learning Data Science (LDS)](https://learningds.org/intro.html)
    * Links to topics and notes for lectures (available on Canvas)
    * Supplemental reading


## Software Packages 

We will be using a wide range of different Python software packages. Note, the packages are already available on your JupyterHub.  If you want to replicate this environment on your local machine you can use the Conda environment manager.  The following is a list of packages we will routinely use in lectures and homeworks:

In [None]:
# Linear algebra, probability
import numpy as np

# Data manipulation
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Interactive visualization library
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px

Note, we will **not** be using the `datascience` library that was the focus in DATA 1202.  We will be transitioning to use industry-standard tools such as `pandas` and more. 

# The Data Science Lifecycle 

Note, there are many versions of the data science lifecycle (with 4, 5, or 6 stages), but all express similar ideas.  

For this class, we will use the following depiction of the lifecycle. 

<center><img src="images/ds-lifecycle.svg" width="60%"></center>

**All steps lead to more questions!** We'll refer back to the data science lifecycle repeatedly throughout the course.

## 1. Question/Problem Formulation
<img src="images/ask.png" width="300px" />

- What do we want to know?
- What problems are we trying to solve?
- What hypotheses do we want to test?
- What are our metrics for success?

## 2. Data Acquisition and Cleaning
<img src="images/data_acquisition.PNG" width="300px" />

- What data do we have and what data do we need?
- How will we sample more data?
- Is our data representative of the population we want to study?

## 3. Exploratory Data Analysis & Visualization
<img src="images/understand_data.PNG" width="300px" />

- How is our data organized and what does it contain?
- Do we already have relevant data?
- What are the biases, anomalies, or other issues with the data?
- How do we transform the data to enable effective analysis?

## 4. Prediction and Inference
<img src="images/understand_world.PNG" width="300px" />

- What does the data say about the world?
- Does it answer our questions or accurately solve the problem?
- How robust are our conclusions and can we trust the predictions? 

## Data Science Lifecycle Demo: Are baby names starting with “L” more popular?


## 1. Starting with a Question 

<img src="images/ask.png" width="300px" />

### Lilith, Lilibet … Lucifer? How Baby Names Went to 'L'

[This New York Times](https://www.nytimes.com/2021/06/12/style/lilibet-popular-baby-names.html?amp=&smid=fb-nytimes) article claims that baby names beginning with "L" have become more popular over time.

Let's see if these claims are true, based on the data!


## 2. Data Acquisition and Cleaning 

**We will study various methods to collect data.**

<img src="images/data_acquisition.PNG" width="300px" />

To answer our quesiton on if "L" baby names have become more popular let's get some data (source: [SSA.gov website](https://www.ssa.gov/OACT/babynames/index.html)).

In [None]:
# pd stands for pandas, which we will learn starting from next lecture
# Some pandas syntax shared with data8's datascience package
baby = pd.read_csv('../data/baby.csv')

## 3. Exploratory Data Analysis

**We will study exploratory data analysis and practice analyzing new datasets.**

<img src="images/understand_data.PNG" width="300px" />

I didn't tell you the details of the data! Let's check out the data and infer its structure. Then we can start answering the simple questions we posed.

### Peeking at the Data

In [None]:
# Let's peek at the first 20 rows of the baby dataframe


In [None]:
# Let's peek at the first 5 rows (default) of the baby dataframe


### Exploratory Data Analysis on `baby` dataset 



#### How many total babies are counted?

<div class="alert alert-block alert-warning"> Don't worry about the syntax of the code today, we will dive into details starting on Wednesday.  </div>

####  How many unique names are there by year?

##### *What does this tell us?* 

There are many more unique names reported in recent years than in the 1880s. 

Why **doesn't** the above Series actually contain the number of unique names per year?

Filters the baby DataFrame to only include rows where Year is 1880, then counts the occurrences of each unique Name using value_counts('Name').
<details>
<summary>Click to show solution</summary>

<pre>
baby[(baby['Year'] == 1880)].value_counts('Name')
</pre>
</details>


Selects rows from the baby DataFrame where: the Year is 1880, and the Name is 'Grace'.
<details>
<summary>Click to show solution</summary>

<pre>
baby[(baby['Year'] == 1880) & (baby['Name'] == 'Grace')]
</pre>
</details>


#### How many babies were recorded per year? 

In [None]:
baby.groupby('Year').sum().plot()

#### What is the distribution of the length of babies names? 

In [None]:
name_lengths = baby['Name'].str.len()

plt.hist(name_lengths, bins=range(min(name_lengths), max(name_lengths) + 2), edgecolor='black')
plt.xlabel('Name Length')
plt.ylabel('Frequency')
plt.title('Distribution of Length of Names')
average_length = name_lengths.sum() / len(name_lengths)
plt.axvline(average_length, color='red', linestyle='dashed', linewidth=1, label=f'Average: {average_length:.2f}')
plt.legend()
plt.xticks(range(min(name_lengths), max(name_lengths) + 1))
plt.show()

## 4. Analysis: Understanding the World

<img src="images/understand_world.PNG" width="300px" />

#### How many babies names start with "L"?

In [None]:
(baby
 .assign(first_letter=baby['Name'].str[0])
 .query('first_letter == "L"')
 .groupby('Year')
 .sum()
 .plot(title='Number of Babies Born with an "L" Name Per Year', ylabel="Count")
)

### We will often use visualizations to make sense of data
**We will deal with many different kinds of data (not just numbers) and we will study techniques to describe types of data.**

**Can we look at popularity of individual names?** 

We can also see names changing in popularity over time [NBC News](https://www.nbcnews.com/data-graphics/baby-names-2023-most-popular-male-female-rcna151476)

In [None]:
(baby
 .query('Name == "Siri"')
 .groupby('Year')
 .sum()
 .plot(title='Number of Babies Born Named "Siri" Per Year', ylabel="Count")
)

In [None]:
def name_graph(name):
    return (baby
     .query(f'Name == "{name}"')
     .groupby('Year')
     .sum()
     .plot(title=f'Number of Babies Born Named "{name}" Per Year', ylabel="Number")
    )

In [None]:
name_graph('Arya')

In [None]:
name_graph('Taylor')

## 1. New Questions 

Often as you start to become more familiar with your data, it may cause new questions to be raised, e.g., 

* What is the most popular name in the last decade?
* What is the most popular name of all time?
* What was the name with the largest drop in popularity?
* What was the name with the largest increase in popularity?
* What are the names that are used by both girls and boys the most?
* ... 