# `pandas` Part 5: Finding and Replacing Values

# Learning Objectives
## By the end of this tutorial you will be able to:
1. Check datatypes with `dtype`
2. Find and replace missing (null) values with `fillna()`
 

## Files Needed for this lesson: `winemag-data-130k-v2.csv`
>- Download this csv from Canvas prior to the lesson

## The general steps to working with pandas:
1. import pandas as pd
2. Create or load data into a pandas DataFrame or Series
3. Reading data with `pd.read_`
>- Excel files: `pd.read_excel('fileName.xlsx')`
>- Csv files: `pd.read_csv('fileName.csv')`
>- Note: if the file you want to read into your notebook is not in the same folder you can do one of two things:
>>- Move the file you want to read into the same folder/directory as the notebook
>>- Type out the full path into the read function
4. After steps 1-3 you will want to check out your DataFrame
>- Use `shape` to see how many records and columns are in your DataFrame
>- Use `head()` to show the first 5-10 records in your DataFrame

Narration Video: https://youtu.be/miAjPCXzFQs

# Analytics Project Framework Notes
## A complete and thorough analytics project will have 3 main areas
1. Descriptive Analytics: tells us what has happened or what is happening. 
>- The focus of this lesson is how to do this in python.
>- Many companies are at this level but not much more than this
>- Descriptive statistics (mean, median, mode, frequencies)
>- Graphical analysis (bar charts, pie charts, histograms, box-plots, etc)
2. Predictive Analytics: tells us what is likely to happen next
>- Less companies are at this level but are slowly getting there
>- Predictive statistics ("machine learning (ML)" using regression, multi-way frequency analysis, etc)
>- Graphical analysis (scatter plots with regression lines, decision trees, etc)
3. Prescriptive Analytics: tells us what to do based on the analysis
>- Synthesis and Report writing: executive summaries, data-based decision making
>- No analysis is complete without a written report with at least an executive summary
>- Communicate results of analysis to both non-technical and technical audiences

# Descriptive Analytics Using `pandas`

# Initial set-up steps
1. import modules and check working directory
2. Read data in
3. Check the data

# Step 2 Read Data Into a DataFrame with `read_csv()`
>- file name: `winemag-data-130k-v2.csv`
>- Set the index to column 0
>- Note: I'm using the full path name on my laptop because i have the file in a different folder than my ipynb for this lesson

### Check how many rows, columns, and data points are in the `wine_reviews` DataFrame
>- Use `shape` and indices to define variables
>- We can store the values for rows and columns in variables if we want to access them later

### Check a couple of rows of data

### Another step in understanding the data you are working with is checking the data types
>- The analysis will differ depending on the data type
>>- For example, only number fields can be averaged
>>- Text/string analysis usually involves counts/frequencies 

### Checking datatypes with `dtype` and `dtypes`
>- General syntax for `dtype`: dataFrame.field.dtype
>>- Returns the datatype for one field
>- General syntax for `dtypes`: dataFrame.dtypes
>>- Returns the datatypes for all the fields in a dataframe

###  Check one field with `dtype`

### Check all the fields in the data frame with `dtypes`

### Question: What is the average price of all wines? 

### Question: How many wines are there per country in the data frame? 

##### Another way to get wines by country using `groupby`: 

## What are the descriptive analytics for wine price?
>- Include the 10th and 90th percentiles of wines in the analysis
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

## What are the descriptive analytics for country?  

## What are the descriptive analytics for all numerical fields in the data frame? 
>- Note: By default describe() returns all numerical fields when called on a DataFrame. 
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

#### Question: Why would points and price have different count values? 

## What are the descriptive analytics for all non-numeric fields in the DataFrame? 
>- Note: we can use `select_dtypes` with the parameter `include='object'` to only include string fields.
>>- `select_dtypes(include='object')`
>- Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html#pandas.DataFrame.select_dtypes


## Finally, to include every field in the data frame:
>- use `describe(include='all')

# Notice how the fields in `wineReviews` vary in count? 
>- A common occurrence in datasets is missing (aka null) values
>- We can use `pd.isnull` to see all the null values for a particular field
>- We can use `pd.notnull()` to see only non-missing values for a particular field

#### Q: What are all the wines with missing country values?

## Now, suppose we want to replace a missing value with `Unknown`
>- We can use a pandas function called `fillna()` and pass the value "Unknown" to it

#### Replace null values for `region_2` with 'Unknown'

### To store the non-null values in a DataFrame...

# Using `replace()` to replace specific values
>- Suppose a taster in the dataset gets a new twitter handle
>>- We can can use `replace()` to update this data

#### Task: Kerin O'Keefe  is changing her twitter handle from `@kerinokeefe` to `@kerino`
>- Use pandas `replace()` to make the change in our DataFrame