# Week 1 Overview 
As a data scientist, you will often have possession of a dataset that you know almost nothing about. The features could have cryptic names like “MetricCode” or “CategoryLabel,” and you will need to figure out what is happening with each feature. You might find something wrong with the data values. For example, the data could show some people having negative ages (minus 10 years old) or show children being older than their parents. There might be missing values or duplicated rows. 

This week, you will work with a dataset about bank customers. You will analyze and summarize the data to identify any problems with the data, and you will fix the problems. You will also learn how to use Google, documentation, and ChatGPT to develop programming skills.

## Learning Objectives
At the end of this week, you should be able to:

- Describe the importance of using data (data storytelling) in making information accessible to a wider audience
- Extract important data/dataset from an unconventional format (pivot table, etc.)
- Apply descriptive statistics to real-world data/datasets
- Identify data quality issues of a given dataset 
- Explain how data quality issues impact the effective use/display of the data
- Describe effective (good) and ineffective (bad) graphs
- Create data visualizations using Matplotlib
- Reproduce a graph/data visualization from an example graph/data visualization
- Analyze existing data visualizations to identify strengths and weaknesses 
- Apply principles from Storytelling with Data (SWD) to create graphs

### Key Terms
- **Inconsistent data:** data that don’t make sense in some way, such as a person who is supposed to be twenty feet tall or one thousand years old
- **Duplicate rows:** two rows that contain exactly the same data
- **Imputing a value:** inserting a value into a dataset where one is missing

## 1.1 Lesson: How to Locate Information about Python Programming

### Preprocessing
If running the cells in a notebook that are included in this lesson, you would need to import the following:

In [2]:
import pandas as pd
import numpy as np

from datetime import datetime, timedelta

Let's consider a few possible ways to learn about Python programming. Let's suppose you want to learn how to produce a short summary of the information in your DataFrame.

### 1. Your instructor could provide the information.

You could be provided with a lesson about functions like info() and describe(). If you have a pandas DataFrame called df, then you can summarize its contents using df.info() or df.describe(). df.info() provides a list of column names with their counts and data types. df.describe() will provide information such as the mean, min, max, standard deviation, and quantiles. Thus:

In [3]:
df = pd.DataFrame([[1, 4], [2, 5], [3, 6], [4, 7]], columns = ['A', 'B'])
df.describe()

Unnamed: 0,A,B
count,4.0,4.0
mean,2.5,5.5
std,1.290994,1.290994
min,1.0,4.0
25%,1.75,4.75
50%,2.5,5.5
75%,3.25,6.25
max,4.0,7.0


In this describe() result, we see that the two columns A and B have three elements. The means and other statistics are shown.

### 2. You could look up the information on Google.

If I Google the question, "How do I briefly summarize the contents of a dataframe using Python?" I receive the following link (among others), which discusses the describe() command mentioned above:

https://www.w3schools.com/python/pandas/ref_df_describe.asp

It also provides the complete usage information:

dataframe.describe(percentiles, include, exclude, datetime_is_numeric)

It explains that percentiles is set by default to [0.25, 0.5, 0.75] but we could change that. 

Let's try it! 

Since there are three intervals here rather than four, it might be more meaningful to ask about a 33rd and 67th percentile rather than 25, 50, and 75. We can use 1/3 for 0.33 and 2/3 for 0.67 to get the exact percentile values.

In [4]:
df = pd.DataFrame([[1, 4], [2, 5], [3, 6], [4, 7]], columns = ['A', 'B'])
df.describe(percentiles = [1/3, 2/3])

Unnamed: 0,A,B
count,4.0,4.0
mean,2.5,5.5
std,1.290994,1.290994
min,1.0,4.0
33.3%,2.0,5.0
50%,2.5,5.5
66.7%,3.0,6.0
max,4.0,7.0


Apparently, the 50% value (the median) stays even though we did not specifically request it.

### 3. You could look up the official documentation.

Now that we know we want the pandas describe() function, try Googling: pandas documentation describe.

Here is the general documentation page for pandas:

https://pandas.pydata.org/docs/index.html

Here is the specific page for the describe() function:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

When we look at this, it appears to be showing the most recent version of pandas; this is shown in the upper right corner.

### 4. You could also ask ChatGPT.

Let's try it. Using ChatGPT, enter the prompt: "How do I briefly summarize the contents of a dataframe using Python?"

When we do this, ChatGPT mentions describe() among other options but does not go into detail. However, we could ask it. ChatGPT, "Tell me more about describe() in Python for summarizing dataframes."

Then, we get a good explanation of describe(), although it does not mention the percentiles option. One advantage of using Google or the documentation in addition to ChatGPT is that these sources may provide interesting information that does not directly answer our question. Thus, we might not have known about the various arguments, such as percentiles, if we only used ChatGPT. A second issue is that ChatGPT sometimes hallucinates, as in it makes up information. However, the most important advantage is that we can get more information by examining multiple sources such as Google, documentation, and ChatGPT.

You will have an opportunity to use the ideas in this lesson in this week's Homework activity.

### Think About It
- What are some advantages and disadvantages of searching via Google to find out how to write code?
- What are some advantages and disadvantages of going straight to the documentation?
- What are some advantages and disadvantages of using ChatGPT to find out how to write code?

### Reading | Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons.

- Introduction
    - Bad graphs are everywhere
        - we aren't naturally good at storytelling with data
            - Being able to visualize data and tell storys with it is key to turning it into information that can be used to drive better decision making.
            - There is a story in your data; it takes you, the communicator, to bring that story visually and contextually to life. The lessons will enable you to shift from simply showing data to storytelling with data. 
        - Who this book is written for: this book is written for anyone who needs to communicate something to someone using data.
        - How you'll learn to tell stories with data - 6 lessons
            - Understand the context
            - Choose an appropriate visual display
            - Eliminate Clutter
            - Focus attention where you want it
            - Think like a designer 
            - Tell a story
        

### Reading | Awasthi, A., Krpalkova, L., & Walsh, J. (2024). Deep Learning-Based Boolean, Time Series, Error Detection, and Predictive Analysis in Container Crane Operations. Algorithms, 17(8), 333. 

In this article, you will learn about handling inconsistent data - that is, data that does not make sense.

#### Inconsistent Data
Inconsistent data is data that is inconsistent, conflicted, or incompatible within a dataset or across many datasets. Data inconsistencies can occur for a variety of reasons, including mistakes in data entry, data processing, or data integration. These discrepancies might show as disagreements in data element values, formats, or interpretations. Inconsistent data can lead to faulty analysis, untrustworthy outcomes, and data management challenges.

1. Identifying Missing Values
2. Handling Missing Values
    - Imputation: Imputation: Imputation is the process of filling in missing values. Common imputation methods include mean, median, mode imputation, or more advanced methods like k-Nearest Neighbors (KNN) imputation.
3. Detecting and Handling Outliers
    - Outlier Detection: Outliers are extreme values that deviate significantly from the majority of data points. Common methods include the IQR method and the Z-score method.
    - Handling Outliers: Outliers can be addressed by removing them, transforming the data, or using robust statistical methods that are less sensitive to outliers.
4. Standardizing Data Formats
    - Data format consistency is essential, especially for date, time, and categorical variables. Use functions like as.Date() or as.factor() to standardize formats. Date variables should adhere to a consistent format to ensure accurate analysis and visualization.
5. Dealing with Duplicate Data
    - Duplicate rows can distory analysis results. Ensure that you undestand the criteriea for identifying duplicates, as it may depend on specific columns. 