## 1. Column data problems
#### Inconsistent Columns
#### Missing data
#### Outliers
#### Duplicate Rows
#### Untidy 
#### Column types


## 2. Visually inspect data

    df.columns -> gives column names
    df.shape   -> shape of data
    df.info    -> information about dataframe, types etc
    
### data types in pandas

    object        -> long strings
    int64         -> int
    float64       -> float
    datetime64

## 3. Exploratory data Analysis

    a. Frequency count

        df.<column name>.value_counts(dropna=False)

    b. Summary Statistics

        check for Outliers
        df.describe()         -> only the numeric type data column will be returned
        
    c. Find Outliers and obvious errors
       Data visualisation
        
        Bar Plots : discrete data counts
        
        
        Histograms : continuous data counts
        
        
        Histograms are great ways of visualizing single variables. To visualize multiple variables, boxplots are useful, especially when one of the variables is categorical.
        
        Box Plots : visualize summary, min max quartile
            can be used to find outliers
            
            
        Scatter Plot : plot between two variables    
        


## 4. Tidying data

    Each variable as a separate column.
    Each row as a separate observation.

### melting
    to reshape the data, analysis friendly

    pd.melt()
    airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])
    
    id_vars             -> variables we do not want to melt
    value_vars          -> variables to be melted
    
    pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')
    
### pivoting the data : opposite of melting
    to reshape the data, report friendly
    
    df.pivot_table(index='', columns='', values='', aggfunc=np.mean)
    
    index           -> index column
    columns         -> column to be pivoted
    values          -> value
    aggfunc         -> function to use for duplicate values
    
    
    fix index of pivoted dataframe
    
    df = df.reset_index()
    
### create new columns from existing column

### split and get 

    ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

    ebola_melt['type'] = ebola_melt.str_split.str.get(0)

    ebola_melt['country'] = ebola_melt.str_split.str.get(1)
    

## 5. Combining data

## concatinate dataframe 

#### row wise : concats rows of data, default
    df = pd.concat([df1, df2])                                -> columns names must be similar
    
    df = pd.concat([df1, df2], ignore_index=True)             -> resets the index labels
    
#### column wise : concats columns of data, axis=1


### Globbing

    import glob
    csv_files = glob.glob('*.csv')

    list_data = []

    for file in csv_files:
        data = pd.read_csv(file)
        list_data.append(data)
    
    df = pd.concat(list_data)    

## merge dataframe 

    similar to joins in sql
    
    pd.merge(left=left df, right=right df, on=None, left_on='col name', right_on='col name')
    
    
    on            -> if the col names in both df are same


## Data types

    df.dtypes
    
   #### converting data types
   
       df['col'] = df['col'].astype(str)
       
       df['col'] = df['col'].astype('category')                     ->converting data to category reduces memory usage
       
       df['col'] = pd.to_numeric(df['col'], errors='coerce')        -> using errors='coerce' converts any non numeric values to Nan
       
       
       
   #### String manipulation
   
       re - Regular Expression
       
       123342             ->   \d*
       $123342            ->   \$\d*
       $123342.123        ->   \$\d*\.\d*
       $123342.12         ->   \$\d*\.\d{2}
       
       
       word               ->   \w
       digit              ->   \d
       
       compile pattern
       use compiled pattern to match values
       
       wildcards *   -> match any
                 +   -> add to previous match
       
       import re
       pattern = re.compile('\$\d\.\d{2}')
       result = pattern.match('$17.89')
       bool(result)
       
       findall('pattern', 'string')             -> find all matches of pattern in string 
   
       pattern = '^[A-Za-z\.\s]*$'
       mask = countries.str.contains(pattern)
        
## Apply  

#### Examples
    tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

    tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])


### Duplicate data

    df.drop_duplicates()
    
### Missing data
    leave as it is
    replace
    drop
    
    df.isnull().sum(axis=0)
    
   #### drop missing values
    
    df.dropna()
    
   #### fill missing values
   
    df.fillna('value')
    
    oz_mean = airquality.Ozone.mean()
    airquality['Ozone'] = airquality.Ozone.fillna(oz_mean) 

   #### assert
   
    assert pd.notnull(ebola).all().all()
    assert (ebola >= 0).all().all()

head()
info()
describe()
columns
shape