# Data Wrangling

# Imports

In [1]:
import pandas as pd

# Objectives

- Data Collection
    - Goal: Organize your data to streamline the next steps of your capstone
        - Data loading
        - Data joining
        - Hint: Data Collection will require the use of the pandas library, and functions like read_csv(), depending on the type of data you want to read in!
        - Hint: when adding one dataset to another, make sure you use the right function: you might want to merge, join, or concatenate.
- Data Organization
    - Goal: Create a file structure and add your work to the GitHub repository you’ve created for this project.
        - File structure
        - GitHub
        - Hint: the glob library could come in handy here…
        - Remind yourself of why GitHub is useful. What are the main motivations for making a GitHub repository?
- Data Definition
    - Goal: Gain an understanding of your data features to inform the next steps of your project.
        - Column names
        - Data types
        - Description of the columns
        - Counts and percents unique values
        - Ranges of values
- Hint: here are some useful questions to ask yourself during this process:
    - Do your column names correspond to what those columns store?
    - Check the data types of your columns. Are they sensible?
    - Calculate summary statistics for each of your columns, such as mean, median, mode, standard deviation, range, and number of unique values. What does this tell you about your data? What do you now need to investigate?
- Data Cleaning
    - Goal: Clean up the data in order to prepare it for the next steps of your project.
        - NA or missing values
        - Duplicates
- Hint: don’t forget about the following awesome Python functions for data cleaning, which make life a whole lot easier:
    - loc[] - filter your data by label
    - iloc[] - filter your data by indexes
    - apply() - execute a function across an axis of a DataFrame
    - drop() - drop columns from a DataFrame
    - is_unique() - check if a column is a unique identifier
    - Series methods, such as str.contains(), which can be used to check if a certain substring occurs in a string of a Series, and str.extract(), which can be used to extract capture groups with a certain regex (or regular expression) pattern
    - numPy methods like .where(), to clean columns. Recall that such methods have the structure: np.where(condition, then, else)
    - DataFrame methods to check for null values, such as df.isnull().values.any()

# Data Collection

## Loading the data

In [5]:
stroke_data = pd.read_csv('stroke prediction/healthcare-dataset-stroke-data.csv')

In [6]:
stroke_data.info

<bound method DataFrame.info of          id  gender   age  hypertension  heart_disease ever_married  \
0      9046    Male  67.0             0              1          Yes   
1     51676  Female  61.0             0              0          Yes   
2     31112    Male  80.0             0              1          Yes   
3     60182  Female  49.0             0              0          Yes   
4      1665  Female  79.0             1              0          Yes   
...     ...     ...   ...           ...            ...          ...   
5105  18234  Female  80.0             1              0          Yes   
5106  44873  Female  81.0             0              0          Yes   
5107  19723  Female  35.0             0              0          Yes   
5108  37544    Male  51.0             0              0          Yes   
5109  44679  Female  44.0             0              0          Yes   

          work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0           Private          Urban 