# **ABC project**

Your ABC data team is still in the early stages of their latest project. 

The management team is asking for a Python notebook showing data structuring and cleaning, as well as any matplotlib/seaborn visualizations plotted to help understand the data. At the very least, include a box plot of the ride durations and some time series plots, like a breakdown by quarter or month. 

Additionally, the management team has recently asked all EDA to include Power Bi visualizations. For this taxi data, create a Power Bi dashboard showing a New York City map of taxi/limo trips by month. Make sure it is easy to understand to someone who isn’t data savvy, and remember that the assistant director at the New York City TLC is a person with visual impairments.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# Course 2: Exploratory data analysis

In this activity, you will examine data provided and prepare it for analysis. You will also design a professional data visualization that tells a story, and will help data-driven decisions for business needs. 

Please note that the Power Bi visualization activity is optional, and will not affect your completion of the course. Completing the Power Bi activity will help you practice planning out and plotting a data visualization based on a specific business need. The structure of this activity is designed to emulate the proposals you will likely be assigned in your career as a data professional. Completing this activity will help prepare you for those career moments.

**The purpose** of this project is to conduct exploratory data analysis on a provided data set. Your mission is to perform further EDA on this data with the aim of learning more about the variables. 
  
**The goal** is to clean data set and create a visualization.
<br/>  
*This activity has 4 parts:*

**Part 1:** Imports, links, and loading

**Part 2:** Data Exploration
*   Data cleaning


**Part 3:** Building visualizations

**Part 4:** Evaluate and share results

<br/> 
Follow the instructions and answer the questions below to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. 



<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## PACE: Plan 

In this stage, consider the following questions where applicable to complete your code response:
1. Identify any outliers: 


*   What methods are best for identifying outliers?
*   How do you make the decision to keep or exclude outliers from any future models?



In this stage, consider the following questions where applicable to complete your code response:
Identify any outliers:
What methods are best for identifying outliers?
How do you make the decision to keep or exclude outliers from any future models?### Task 1. Imports, links, and loading
Go to Power Bi Desktop. Keep Power Bi Desktop open as you proceed to the next steps.

Link to supporting materials:
[Power Bi Desktop: https://public.Power Bi.com/s/](https://powerbi.microsoft.com/en-us/desktop/)

For EDA of the data, import the data and packages that would be most helpful, such as pandas, numpy and matplotlib. 


In [1]:
# Import packages and libraries
#==> ENTER YOUR CODE HERE


**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Load dataset into dataframe


<img src="images/Analyze.png" width="100" height="100" align=left>

## PACE: Analyze 

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### Task 2a. Data exploration and cleaning

Decide which columns are applicable

The first step is to assess your data. Check the Data Source page on Power Bi Public to get a sense of the size, shape and makeup of the data set. Then answer these questions to yourself: 

Given our scenario, which data columns are most applicable? 
Which data columns can I eliminate, knowing they won’t solve our problem scenario? 

Consider functions that help you understand and structure the data. 

*    head()
*    describe()
*    info()
*    groupby()
*    sortby()

What do you do about missing data (if any)? 

Are there data outliers? What are they and how might you handle them? 

What do the distributions of your variables tell you about the question you're asking or the problem you're trying to solve?




==> ENTER YOUR RESPONSE HERE

Start by discovering, using head and size. 

In [3]:
#==> ENTER YOUR CODE HERE


In [4]:
#==> ENTER YOUR CODE HERE


Use describe... 

In [5]:
#==> ENTER YOUR CODE HERE


And info. 

In [6]:
#==> ENTER YOUR CODE HERE


### Task 2b. Assess whether dimensions and measures are correct

On the data source page in Power Bi, double check the data types for the applicable columns you selected on the previous step. Pay close attention to the dimensions and measures to assure they are correct. 

In Python, consider the data types of the columns. *Consider:* Do they make sense? 

### Task 2c. Select visualization type(s)

Select data visualization types that will help you understand and explain the data.

Now that you know which data columns you’ll use, it is time to decide which data visualization makes the most sense for EDA of the ABC dataset. What type of data visualization(s) would be most helpful? 

* Line graph
* Bar chart
* Box plot
* Histogram
* Heat map
* Scatter plot
* A geographic map


As you'll see below, a bar chart, box plot and scatter plot will be most helpful in your understanding of this data.

A box plot will be helpful to determine outliers and where the bulk of the data points reside in terms of trip_distance, duration, and total_amount

A scatter plot will be helpful to visualize the trends and patters and outliers of critical variables, such as trip_distance and total_amount

A bar chart will help determine average number of trips per month, weekday, weekend, etc.

<img src="images/Construct.png" width="100" height="100" align=left>

## PACE: Construct 

Consider the questions in your PACE Strategy Document to reflect on the Construct stage.

### Task 3. Data visualization

You’ve assessed your data, and decided on which data variables are most applicable. It’s time to plot your visualization(s)!


### Boxplots

Perform a check for outliers on relevant columns such as trip distance and trip duration. Remember, some of the best ways to identify the presence of outliers in data are box plots and histograms. 

**Note:** Remember to convert your date columns to datetime in order to derive total trip duration.  

In [7]:
# Convert data columns to datetime

#==> ENTER YOUR CODE HERE


**trip distance**

In [8]:
# Create box plot of trip_distance
#==> ENTER YOUR CODE HERE


In [9]:
# Create histogram of trip_distance

#==> ENTER YOUR CODE HERE


**total amount**

In [10]:
# Create box plot of total_amount
#==> ENTER YOUR CODE HERE

In [11]:
# Create histogram of total_amount
#==> ENTER YOUR CODE HERE


**tip amount**

In [12]:
# Create box plot of tip_amount
#==> ENTER YOUR CODE HERE


In [13]:
# Create histogram of tip_amount
#==> ENTER YOUR CODE HERE


**tip_amount by vendor**

In [14]:
# Create histogram of tip_amount by vendor
#==> ENTER YOUR CODE HERE

Next, zoom in on the upper end of the range of tips to check whether vendor one gets noticeably more of the most generous tips.

In [15]:
# Create histogram of tip_amount by vendor for tips > $10 
#==> ENTER YOUR CODE HERE

**Mean tips by passenger count**

Examine the unique values in the `passenger_count` column.

In [16]:
#==> ENTER YOUR CODE HERE

In [17]:
# Calculate mean tips by passenger_count
#==> ENTER YOUR CODE HERE

In [18]:
# Create bar plot for mean tips by passenger count
#==> ENTER YOUR CODE HERE


**Create month and day columns**

In [19]:
# Create a month column
#==> ENTER YOUR CODE HERE
# Create a day column
#==> ENTER YOUR CODE HERE


**Plot total ride count by month**

Begin by calculating total ride count by month.

In [20]:
# Get total number of rides for each month
#==> ENTER YOUR CODE HERE


Reorder the results to put the months in calendar order.

In [21]:
# Reorder the monthly ride list so months go in order
#==> ENTER YOUR CODE HERE


In [22]:
# Show the index
#==> ENTER YOUR CODE HERE


In [23]:
# Create a bar plot of total rides per month
#==> ENTER YOUR CODE HERE


**Plot total ride count by day**

Repeat the above process, but now calculate the total rides by day of the week.

In [24]:
# Repeat the above process, this time for rides by day
#==> ENTER YOUR CODE HERE


In [25]:
# Create bar plot for ride count by day
#==> ENTER YOUR CODE HERE


**Plot total revenue by day of the week**

Repeat the above process, but now calculate the total revenue by day of the week.

In [26]:
# Repeat the process, this time for total revenue by day
#==> ENTER YOUR CODE HERE


In [27]:
# Create bar plot of total revenue by day
#==> ENTER YOUR CODE HERE


**Plot total revenue by month**

In [28]:
# Repeat the process, this time for total revenue by month
#==> ENTER YOUR CODE HERE


In [29]:
# Create a bar plot of total revenue by month
#==> ENTER YOUR CODE HERE


#### Scatter plot

You can create a scatterplot in Power Bi , which can be easier to manipulate and present. Those instructions create a scatterplot showing the relationship between total_amount and trip_distance. Consider adding the Power Bi visualization to your executive summary, and adding key insights from your findings on those two variables.

**Plot mean trip distance by drop-off location**

In [30]:
# Get number of unique drop-off location IDs
#==> ENTER YOUR CODE HERE


In [31]:
# Calculate the mean trip distance for each drop-off location
#==> ENTER YOUR CODE HERE

# Sort the results in descending order by mean trip distance
#==> ENTER YOUR CODE HERE


In [32]:
# Create a bar plot of mean trip distances by drop-off location in ascending order by distance
#==> ENTER YOUR CODE HERE


## BONUS CONTENT

To confirm your conclusion, consider the following experiment:
1. Create a sample of coordinates from a normal distribution&mdash;in this case 1,500 pairs of points from a normal distribution with a mean of 10 and a standard deviation of 5
2. Calculate the distance between each pair of coordinates 
3. Group the coordinates by endpoint and calculate the mean distance between that endpoint and all other points it was paired with
4. Plot the mean distance for each unique endpoint

In [33]:
#BONUS CONTENT

#1. Generate random points on a 2D plane from a normal distribution
#==> ENTER YOUR CODE HERE


# 2. Calculate Euclidean distances between points in first half and second half of array
#==> ENTER YOUR CODE HERE

# 3. Group the coordinates by "drop-off location", compute mean distance
#==> ENTER YOUR CODE HERE

# 4. Plot the mean distance between each endpoint ("drop-off location") and all points it connected to
#==> ENTER YOUR CODE HERE


**Histogram of rides by drop-off location**

First, check to whether the drop-off locations IDs are consecutively numbered. For instance, does it go 1, 2, 3, 4..., or are some numbers missing (e.g., 1, 3, 4...). If numbers aren't all consecutive, the histogram will look like some locations have very few or no rides when in reality there's no bar because there's no location. 

In [34]:
# Check if all drop-off locations are consecutively numbered
#==> ENTER YOUR CODE HERE



To eliminate the spaces in the historgram that these missing numbers would create, sort the unique drop-off location values, then convert them to strings. This will make the histplot function display all bars directly next to each other. 

In [35]:

#==> ENTER YOUR CODE HERE
# DOLocationID column is numeric, so sort in ascending order
#==> ENTER YOUR CODE HERE

# Convert to string
#==> ENTER YOUR CODE HERE

# Plot


<img src="images/Execute.png" width="100" height="100" align=left>

## PACE: Execute 

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### Task 4a. Results and evaluation

Having built visualizations in Power Bi and in Python, what have you learned about the dataset? What other questions have your visualizations uncovered that you should pursue? 

***Pro tip:*** Put yourself in your client's perspective, what would they want to know? 

Use the following code fields to pursue any additional EDA based on the visualizations you've already plotted. Also use the space to make sure your visualizations are clean, easily understandable, and accessible. 

***Ask yourself:*** Did you consider color, contrast, emphasis, and labeling?



==> ENTER YOUR RESPONSE HERE

I have learned .... 

My other questions are .... 

My client would likely want to know ... 

In [36]:
#==> ENTER YOUR CODE HERE


In [37]:
#==> ENTER YOUR CODE HERE


### Task 4b. Conclusion
*Make it professional and presentable*

You have visualized the data you need to share with the director now. Remember, the goal of a data visualization is for an audience member to glean the information on the chart in mere seconds.

*Questions to ask yourself for reflection:*
Why is it important to conduct Exploratory Data Analysis? Why are the data visualizations provided in this notebook useful?



EDA is important because ... 
==> 

Visualizations helped me understand ..
==>


You’ve now completed professional data visualizations according to a business need. Well done! 