# Cleaning Data:

Cleaning the data involves the following:

1.Identify missing data

2.Treat (delete or impute) missing values

Generally in python missing data is represented by NaN(Not a Number) or NULL.

### 4 Main Methods to identify and treat missing data:

#### isna():
Indicates presence of a missing value, returns boolean.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values

#### Syntax:
DataFrame.isna()

#### Returns:
Dataframe: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value

In [None]:
df = pd.DataFrame({'age': [5, 6, np.NaN],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})

df
df.isna() #Show which entries in a DataFrame are NA.


#### notna(): 
It is the opposite of isnull, returns boolean.
Detect non-missing values for an array-like object.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True.None or numpy.NaN, get mapped to False values.

#### Syntax:
DataFrame.notna()

#### Returns:
DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value


In [None]:
df = pd.DataFrame({'age': [5, 6, np.NaN],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})

df
df.notna() #Show which entries in a DataFrame are not NA.

#### dropna(): 
Drops the missing values from a dataframe and returns the rest.

#### Syntax:
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

#### Parameters:
axis : 
{0 or ‘index’, 1 or ‘columns’}, default 0

Determine if rows or columns which contain missing values are removed.
•0, or ‘index’ : Drop rows which contain missing values.
•1, or ‘columns’ : Drop columns which contain missing value

how :
{‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.

thresh :
int, optional
Require that many non-NA values.

subset :
array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

inplace :
bool, default False
If True, do operation inplace and return None.

#### Returns: Dataframe
DataFrame with NA entries dropped from it


In [None]:
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
...                             pd.NaT]})
df
df.dropna() #Drop the rows where at least one element is missing.

df.dropna(axis='columns') #Drop the columns where at least one element is missing

df.dropna(how='all') #Drop the rows where all elements are missing.

df.dropna(thresh=2) #Keep only the rows with at least 2 non-NA values.

df.dropna(subset=['name', 'born']) #Define in which columns to look for missing values.

df.dropna(inplace=True) #Keep the DataFrame with valid entries in the same variable.
df

#### fillna():
Fill NA/NaN values using the specified method

#### Syntax:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)

#### Parameters:
value :
scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method :
{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis :
{0 or ‘index’, 1 or ‘columns’}

inplace :
boolean, default False
If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit :
int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast :
dict, default is None
a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

#### Returns:
filled : DataFrame

In [None]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                    columns=list('ABCD'))
df 
df.fillna(0) #Replace all NaN elements with 0s.
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values) #Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
df.fillna(value=values, limit=1) #Only replace the first NaN element.