# Data Science Project
## by Hannah Kwon

### Identifying and Defining
Choose your data scenario and define your purpose:                                           
I want to discover the crime rates in certain areas. In this case I have chosen the area to be the US. I will be looking at population, total, violent, property, murder, forcible rape, robbery, aggravated assault, burglary, larceny theft and vehicle theft rates. I would like to see the difference in the crime rates over the years. The data set I will be using is ‘US crime rates 1960 - 2014’, this data set is publicly available in the form of a .csv, which is the way I will be accessing the data when needed.

Functional requirements:                                  
Data Loading:                                 
This program needs to be able to load my data into designated warehouses. It also would have to be able to handle minor errors while my data is being moved. The user will input the data and the designated warehouse for the data to move into. The program will output the data set which is now moved into the warehouse based on the user input.                                     
Data Cleaning:                                   
This program needs to be able to fix or remove the incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It also would have to be able to handle missing values within the data set. The user will input the data that would be needed to be cleaned. The program will output the data set which now is fixed and incomplete or incorrect data would no longer be in the data set.                      
Data Analysis:                                  
This program needs to be able to provide statistical information based on the data given. It also would have to be able to convert the data into information useful for decision making by users. The user will input the data that they would like to be analyzed. The program will output the data now as information that would be useful to the user.                                
Data Visualisation:                              
This program needs to be able to provide visual images that show the data that the user has inputted. The user will input the data that they would like to be visualized or seen as images, graphs or charts. The program will output the data as a diagram or dataframe, chart or graph.                       
Data reporting:                                                                       
This program needs to be able to collect and format the data into a final output and find somewhere to store the dataset. The user will input the data into the program. The program will output the final product that has gone through all of the steps of the program.                        

Use Cases:                                 
Data Loading:                         
Actor: User               
Goal: To load a dataset into designated warehouse                    
Preconditions: User has a dataset file ready.                

Main Flow:                                            
1. User inputs dataset into the program/system.           
2. System validates the file format and clears any minor errors while data is being moved.                       
3. System loads the dataset into the folder.                         

Postconditions: Dataset is in the folder ready for analysis                            
                                                     
Data Cleaning:                                  
Actor: User                      
Goal: To fix or remove incorrect or incomplete data within a dataset                          
Preconditions: User has a dataset                       

Main Flow:
1. User inputs dataset into system/program                                
2. System identifies the incorrect data.                        
3. System fixes incorrect data                     

Postconditions: Dataset is loaded and ready for analysis.                             
                                                  
Data Analysis:                                                                          
Actor: User                                             
Goal: To provide statistical data based on the dataset                                             
Preconditions: Dataset is ready                                              

Main Flow:                      
1. User inputs data set into system                
2. System analyzes dataset                    
3. System outputs statistical data                   

Postconditions: Dataset is correct and has no errors
                                  
Data Visualisation:                                                                 
Actor: User                       
Goal: To turn the data into a graph/chart/diagram                           
Preconditions: Dataset is ready                      

Main Flow:
1. User inputs dataset                     
2. System decides which visual form would be best for the dataset                     
3. System outputs dataset in a visual form                   

Postconditions: The visual form is accurate                                
                                                 
Data Reporting:                                                          
Actor: User                              
Goal: To collect and format the data into a final output                               
Preconditions: Dataset is ready                              

Main Flow:                  
1. User inputs dataset                            
2. System collects and formats the data so that it is presentable                       
3. System outputs the formatted data                     

Postconditions: Dataset has gone through all the processes                            
                                                            
Non-functional Requirements:                                               
Usability:                                          
From a ‘README’ document and a User Interface, it is required of it having to explain what is expected of and how to do it. It shows where you can find certain information and how to use it in order to achieve a certain target. It also provides key information such as project title, description, table of contents, requirements, etc.                                   
Reliability:                                     
Avoiding ambiguity, data validation and verification are two main points that are required from the system when providing information to the user on errors and ensuring data integrity. Avoiding ambiguity helps with your users knowing why there was an error and how they can address the issue. Data validation and verification helps with your users ensuring data integrity by showing how reliable it is.                


### Researching and Planning
Research of chosen issue:                                     
Purpose:                                 
By looking at this dataset I am trying to see the change in  population, total, violent, property, murder, forcible rape, robbery, aggravated assault, burglary, larceny theft and vehicle theft rates in the US over the years. It is important for further research because it will provide us with information on whether the US is getting better or worse in terms of crime. It will also provide us a better understanding of the past by displaying the crime rates from then, as well as being able to compare them with the crime rates now to see the difference in crime.                                    
Missing data:                              
I have not been able to find any missing data in the dataset so far. If there are bits of data I would have liked to be in the dataset, it would be having the US states instead of just the US as a whole.                                       
Stakeholders:                                              
The police departments all over the US would benefit from my dataset because it will show them what they need to work on by comparing the crime rates from now to then. The US as a whole would also benefit from my dataset because it shows the change in crime rates over the years.                       
Use:                                  
The information I obtain from this data analysis will be used so that it can help with the entirety of the US. It will benefit them by having an extra dataset to work on so that they can achieve a more accurate average. It will also help with somewhat estimating the crime rates in the US due to having the early 2000s in my dataset, they can compare the crime rates from now and then to estimate what the crime rates in the future will be.                  

Privacy and Security:              
Data privacy of source:                          
I am sourcing my data from a website called ‘Kaggle’. The source needs to protect their participants' privacy and not record any information that could harm the specific target. This is another responsibility the source must have as well as protecting the data.                      
Application data privacy:                               
My responsibilities in maintaining user privacy in terms of this data are to familiarise myself with internal privacy policies, processes and procedures, know who is responsible for privacy, use and disclosure, etc. These responsibilities will help with making sure that all information on the user is going to be kept safe and secure. If I was to push this application out to the public my responsibilities would consist of maintaining user privacy and getting rid of errors in the application.                            
Cyber security:                              
Some of the features an application should have in order to maintain cyber security once it is on the web is to maintain security during the web app development, encrypting my data, apply authentication, role management and access control and to be paranoid. The term user authentication is the process in which you verify that someone who is trying to access services and applications is who they say they are and not a random person. Password hashing is making your password a short string of letters and/or numbers, this helps prevent people from stealing your password. Encryption is used to protect data from being stolen, this is done by turning your data into a secret code that can only be unlocked by a special digital key.                                        

Data Dictionary:
![image.png](attachment:image.png)


In [2]:
import pandas as pd

original_df = pd.read_csv('US_Crime_Rates_1960_2014.csv')

print(original_df)

    Year  Population     Total  Violent  Property  Murder  Forcible_Rape  \
0   1960   179323175   3384200   288460   3095700    9110          17190   
1   1961   182992000   3488000   289390   3198600    8740          17220   
2   1962   185771000   3752200   301510   3450700    8530          17550   
3   1963   188483000   4109500   316970   3792500    8640          17650   
4   1964   191141000   4564600   364220   4200400    9360          21420   
5   1965   193526000   4739400   387390   4352000    9960          23410   
6   1966   195576000   5223500   430180   4793300   11040          25820   
7   1967   197457000   5903400   499930   5403500   12240          27620   
8   1968   199399000   6720200   595010   6125200   13800          31670   
9   1969   201385000   7410900   661870   6749000   14760          37170   
10  1970   203235298   8098000   738820   7359200   16000          37990   
11  1971   206212000   8588200   816500   7771700   17780          42260   
12  1972   2

In [22]:
import pandas as pd

yearandtotal_df = pd.read_csv('US_Crime_Rates_1960_2014.csv')

def graph():
    yearandtotal_df.plot(
                    kind='line',
                    x='Total',
                    y='Year',
                    color='blue',
                    alpha=0.3,
                    title='Amount of crime per year')

graph()


ImportError: matplotlib is required for plotting when the default backend "matplotlib" is selected.