# Data Analysis Overview
As we move through this course, we will learn how to piece together a standard data anlysis. Let's begin with an overview of the process so that you can keep in mind how to put all of this together for your final project. 

## Step 1: Locating data
The internet contains a plethora of datasets, but knowing where to find reliable data can be a bit more tricky. Here are a few reputable sources for your project: 

### Data Sources
- [Kaggle](https://www.kaggle.com/): A great resource for datasets, contests and even courses
- [Scikit Learn Built in Datasets](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets): The popular data science package comes with a good set of built-in datasets to practice on
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php): More curated, industry-standard datasets that are great for classification and machine learning problems

### Reliable Data
Once you have located a dataset, you must determine if the data comes from a reliable source. Here are a few things to keep in mind: 
- Get data directly from the source: check the references, and if the dataset points to a publication or institute (aka Google or University of Washington) that created the dataset, get the data from the source
- Get current data: outdated datasets are okay for practice, but in the real world, make sure that the business decisions you will be making are based upon the latest trends
- Make sure your dataset contains enough information: you will want to make sure you can answer all of your questions with this dataset, and that there are enough representative samples to make an informed decision

[Here's a good article with more information.](https://blog.hubspot.com/marketing/find-good-data)

## Step 2: Preparing your data (Data Munging)
![data_wrangling](https://www.promptcloud.com/wp-content/uploads/2018/05/data-wrangling-how-to-do-effectively.jpg)
You will hear that a data analyst/data scientist spends 80% of thier time making data usable. It's true. And it includes such things as: 
- Fixing formatting errors and misspelled words
- Restructuring and removing redundancies
- Flattening and grouping data
- Fixing errors
- Locating misssing entries
- Normalization
- Binning

Basically, getting your data into a format that you can work with. 

## Step 3: Exploratory Data Anaysis (EDA)
Once you've cleaned up and prepared your data, you can start taking a deeper look at what's in your dataset. This is a very important step that allows you to start thinking about what questions you can ask and the right statisitcal models to try. 
- View summary statistics such as mean, quantiles, variance, standard deviation
- Determine the type of distribution (normal, bimodal, skwed)
- Examine data types (numerical, continuous, caetgorical)
- Deal with missing values
- Determine potential relationships between your variables
- Identify and isolate outliers

## Step 4: Selecting a Statistical Model
During EDA, you'll get a better feel for your data that helps you identify which model to choose. 

![image.png](attachment:image.png)

For example: 
- To examine the relationship between continuous variables, linear regression
- To examine the relationship between binary/discrete variables (cat/dog, yes/no), logistic regression


## Step 5: Check Model Assumptions
Each statistical model you choose has mathmatical and logical requirements that we must check exist before we can apply our model. For example, linear regression assumes the that the relationship between independent and dependent variables is linear. We can check this by: 
- Visualizing our distributions
- Check for indepence of variables
- Plot relationships, such as in a scatterplot

Each model has it's own assumptions, and you will need to learn the methods for checking those assumptions before continuing with your model of choice.

## Step 6: Build the Model
After all this hard work, we can finally run our dataset through our model(s) of choice. This means we will finally implement that linear regression or cluster our dataset, etc. 

## Step 7: Evaluate the Model
Once we run our data through our statistical model, we must evalute it's performance. Each statistical model comes with it's own metrics for evaluating the accuracy of the model. 
- Variance
- Confidence intervals
- Mean Squared Error
- Ordinary Least Squared
- AUC/ROC

## Step 8: Interpretation and Conclusion
And finally, we do make of our analysis? Explain relationships found assumptions made, and support your interpretaiton of the results with evidence from your analysis. If there are issues, assumptions or areas that need further exploration, state those here, as well. 