# DSCI 521: Data Analysis and Interpretation <br> Term Project Phase 1: Scoping an analytics project

## The big picture
Welcome to your term project! This is the first portion of a two-part, open-ended team assignment. It will culminate in a presentation during the last week of class or the regularly scheduled final exam period. 

Overall, this term project is intended to provide some open-ended experience with exploring datasets for possible associations, relationships, and predictive capacities. This will then be followed up with the execution more complex and intensive analysis that prototypes the function of a potential application or underpin an empirical finding. Specifically, all projects for this course will entail the following two phases:

1. A topically-motivated exploration of available pre-processed datasets, i.e., exploratory data analyses. 
2. Interpreting the outcome of phase (1), the selection and execution of a more in depth analytics project, prototyping an application's function or empirical investigation.

The first report on your team's project will summarize any visual and quantitative exploratory findings from (1) and outline and motivate a course for (2). So this first phase should include both a discussion of the __availability and scale of the project's target data__ and an exploratory analysis and discussion of __why it may be possible to conduct the target analysis__, including any relevant pre-built technologies and tools that will make it possible, and how. Additionally, this inital project planning report should speculate and provide examples for potential analytic outcomes and how we might interpret and build from them, whether towards academic or commercial outcomes.

__Note__: All reports should inclue a high level abstract/discussion in a tone that is set for a completely diverse audience.

At the end of the term, your group will provide a final report that recaps progress at the task you're group has come up with, specifically revisiting what you _thought_ the results of the in depth analysis would be, as compared to the actual work involved, obstacles encountered, and results obtained.

__Important__: because your project reports will have discussion intermingled with data and code as output, I not only request the  submission of your work in Jupyter Notebooks format, but additionly recommend conducting your work as a group collaboratively in Jupyter notebooks.

## This is only a guideline

While this document provides some idea of structure and expectation for your project it is important to note that this is an intentionally open-ended project. Hence, no specific rubric is provided. The courses of different projects will require overcoming different obstacles, and success in a data science project is ultimately a (partial) function of a team's abaility to adapt to project needs. However, all work should be well documented, articulately presented, and justified. If at any point it is unclear what to do or how to represent your project's work, please do not hesitate to ask your instructor for direction.

## Your team

The first thing you'll have to do in this phase is organize into a project team. Data science is often conducted in teams, with different team members covering the diversity of knowledge and skills relevant to the different areas that a project must support to succeed. Be sure to consider the strengths of your teamates and interests for gaining experience with analytics&mdash;if you want extensive experience with network modeling, pitch a project about this with a few other points of interest. It will help to discuss interests. Be sure to write out the names of the project team's members in your first report and answer the two questions:

1. What areas/skills/domains does the team member presently identify with?
2. Into which areas/skills/domains would the team member like to grow?

## Your topic
The course of your project will be determined by a few things:

1. the motivations present in your project's team,
2. the availability of relevant pre-processed data,
3. an exploratory motivation for in-depth analysis, and
4. synnergies with appropriate tools and quantitative frameworks. 

Thus, choosing your topic is closely tied to your team, the data you are able to identify, and trends/tools you are able to utilize. To start, discuss the domain interests present on your project team. Te get you on your way, let's start with two questions:

1. Is there an aspect of the IoT, natural world, society, literature, or art, etc. that you would like to investigate computationally through some avaliable, pre-processed, and hopefully relevant data? 

2. What sort of analytic tools and data-medium are you interested to work with?

Whatever the direction you set for your project please make sure you document it well, keeping track of how its objectives and strategies change as you encounter available materials and other existing work.

## What you're responsible for in this phase
So here's the goal for phase 1. You must:

> Conduct an extensive exploratory data analysis focused on an identified topic and potentially relevant, available datasets, addressing their capacities for a more in-depth analysis. 

This phase of the project will set expectations and a work plan for your project's in-depth analysis. Not only should you scope the utility of available datasets, but identify potential avenues of continued study towards either an analytically-backed application or empirical finding. 

For many DSCI 521 projects, a source of poject data may be pre-processed materials derived from a previous DSCI 511 term project, as those are focused on dataset construction. If this is the case, please clearly indicate the work that this project builds from, in additon to the final state the dataset was in when your group resumed work. No matter what, this project's deliverables are the same&mdash;regardless of dataset sources!

### Phase 1 report checklist
Here's a checklist of items that you _absolutely_ should include:

1. [ ] a background report on the team's members, their self-identified skills, and individual contributions
- [ ] a discussion of what you would like to your analysis to do, who/what it will support
- [ ] an exhibition of analyses from dataset(s) explored, including visual analyses, captions, and useful descripitions
- [ ] a discussion of who might be interested in your analysis
- [ ] a discussion of how your analysis might fit into an application or investigation
- [ ] a discussion of how your analysis is limited and could be improved
- [ ] a selection of data for continued analysis, including justification
- [ ] a discussion of how your analysis might be completed and disseminated, i.e., who's the target audience?

Additionally, by the end of the term your final report should inlcude items like
1. [ ] a README.md that describes what is present in the project analysis and how it may be repeated
- [ ] code that documents your analysis&mdash;your instructor should be able to re-run the analysis!
- [ ] tables, figures, and discussion supporting the analysis' interpretation

_Note_: These are not exhaustive lists of topics or tasks worth covering in your project. In general, if there's something interesting about your project, whether relating to the source data's construction, existence, or novelty of the tergeted tools and applications, or _anything else_, then be sure to document it!

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
acci = pd.read_csv('US_Accidents_Dec19.csv')

In [7]:
acci

Unnamed: 0,ID,Source,TMC,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,MapQuest,201.0,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,MapQuest,201.0,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,...,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,MapQuest,201.0,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,...,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,MapQuest,201.0,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,...,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,MapQuest,201.0,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,...,False,False,False,False,True,False,Day,Day,Day,Day
5,A-6,MapQuest,201.0,3,2016-02-08 07:44:26,2016-02-08 08:14:26,40.100590,-82.925194,,,...,False,False,False,False,False,False,Day,Day,Day,Day
6,A-7,MapQuest,201.0,2,2016-02-08 07:59:35,2016-02-08 08:29:35,39.758274,-84.230507,,,...,False,False,False,False,False,False,Day,Day,Day,Day
7,A-8,MapQuest,201.0,3,2016-02-08 07:59:58,2016-02-08 08:29:58,39.770382,-84.194901,,,...,False,False,False,False,False,False,Day,Day,Day,Day
8,A-9,MapQuest,201.0,2,2016-02-08 08:00:40,2016-02-08 08:30:40,39.778061,-84.172005,,,...,False,False,False,False,False,False,Day,Day,Day,Day
9,A-10,MapQuest,201.0,3,2016-02-08 08:10:04,2016-02-08 08:40:04,40.100590,-82.925194,,,...,False,False,False,False,False,False,Day,Day,Day,Day


In [8]:
The dataset consists of traffic accidents between Feb 2016 to Dec 2019 across all the states in US. This Analysis would help in identifying the major cause for road accidents in United States. We like to analyze and interpret these accidents data to find accident prone locations, casualties and also how weather or enviroment have impact over the accident occurance. It will support in estimating causes for accident predictions.

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import matplotlib.pyplot as plt

acci = pd.read_csv("/content/drive/My Drive/DSCI521/US_Accidents_Dec19.csv")
print('Shape of the Dataset')
print(acci.shape)
print('\n\n')
print('Variables in the Dataset')
print(acci.columns)
print('\n\n')
print('NA values count')
len(acci) - acci.count()


acci['year'] = pd.to_datetime(acci['Start_Time']).dt.year

swma = acci['year'].value_counts().reset_index()
swma=swma[swma['index'] >= 2016]
swma=swma[swma['index'] < 2020]
plt.scatter(swma['index'].astype(str),swma['year'])
plt.plot(swma['index'].astype(str),swma['year'])
plt.title('Accidents - Year wise')
plt.xlabel('year')
plt.ylabel('No.of Accidents')
plt.gca().invert_xaxis()
plt.show()
print('High number of accidents were recorded in the year 2019')


swma = acci.query('year == "2019"')['State'].value_counts().reset_index()
swma1 = acci.query('year == "2018"')['State'].value_counts().reset_index()
swma2= acci.query('year == "2017"')['State'].value_counts().reset_index()
swma3 = acci.query('year == "2016"')['State'].value_counts().reset_index()
plt.scatter(swma['index'][:5],swma['State'][:5],label='2019')
plt.scatter(swma1['index'][:5],swma1['State'][:5],label='2018')
plt.scatter(swma2['index'][:5],swma2['State'][:5],label='2017')
plt.scatter(swma3['index'][:5],swma3['State'][:5],label='2016')
plt.title('Top 5 States with high accidents')
plt.xlabel('States')
plt.ylabel('No.of Accidents')
plt.legend()
plt.show()
print(' Most accidents happend in California and 2019 were recorder the highest. \n South Carolina recorded almost same number of accidents in all years.\n Texas, Florida and NewYork had low numbers in 2016 but were high in other years.\n\n\n')



nyac = acci.query('State == "NY"')['Visibility(mi)'].value_counts().reset_index()
plt.gca().invert_xaxis()
plt.barh(nyac['index'][:7],nyac['Visibility(mi)'][:7])
plt.title('Visibility during Accidents accident in Texas')
plt.ylabel('States')
plt.xlabel('No.of Accidents')
nyac = acci.query('State == "NY"')['Visibility(mi)'].value_counts().reset_index()
plt.gca().invert_xaxis()
plt.barh(nyac['index'][:7],nyac['Visibility(mi)'][:7])
plt.show()
print('High number of accidents happened in Texas eventhough when the visibility was clear for 10 miles\n\n\n')

nyac = acci.query('State == "NY"')['Weather_Condition'].value_counts().reset_index()
plt.gca().invert_xaxis()
plt.barh(nyac['index'][:7],nyac['Weather_Condition'][:7])
plt.title('Top weather conditions for accident in New York')
plt.ylabel('States')
plt.xlabel('No.of Accidents')
nyac = acci.query('State == "NY"')['Weather_Condition'].value_counts().reset_index()
plt.gca().invert_xaxis()
plt.barh(nyac['index'][:7],nyac['Weather_Condition'][:7])
plt.gca().invert_yaxis()
plt.show()
print('High number of accidents happened in NewYork when it was cloudy\n\n\n')


KeyboardInterrupt: 

In [None]:
Applications of Analysis

The objective of this analysis is to understand the underlying causes of accidents in the US. It will also enable to find patterns in various factors like weather conditions, latitude longitude and location with respect to occurance of accidents.

Entities who might be interested in analysis

Government can use this data to warn citizens against accident prone areas
Construction agency can use this information to improve road conditions and reduce accidents
Practical Use of Data Analysis :

The analysis of this data is useful in applications which focus on displaying real-time accident information. It could also be embedded into Google/Apple Maps as a new feature. Any investigation pertaining to reason for accidents in US can make use of this information

Limitations :

This analysis does not give real time information
Improvements :

It can be improved by including more data
Double-click (or enter) to edit