### Loading dataframes

In [None]:
import pandas as pd

### Check for duplicates

In [None]:
df = pd.DataFrame({'a':[1,1,1,2,2,3,4,5],
                  'b':[10,10,11,20,20,30,40,50]})

In [None]:
df

In [None]:
df.duplicated()

In [None]:
df.duplicated(subset=['a'])

In [None]:
df.duplicated().sum()

### Drop duplicated rows

In [None]:
df_no_duplicates = df.drop_duplicates()

In [None]:
len(df), len(df_no_duplicates)

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
len(df)

### Check for missing values

In [None]:
df = pd.DataFrame({'a':[1,2,3,4],
                  'b':[10,None,30,40],
                  'c':[None,None,None,400]})

In [None]:
df.isnull()

In [None]:
df.isnull().any(axis=0)

In [None]:
df.isnull().any(axis=1)

In [None]:
df.isnull().sum()

### Drop rows with missing values

In [None]:
df

In [None]:
df_only_full_rows = df.dropna()
df_only_full_rows

In [None]:
df_rows_where_b_is_not_missing = df.dropna(subset=['b'])
df_rows_where_b_is_not_missing

### Replace missing values

In [None]:
df

In [None]:
mean_b = df['b'].mean()
df_missing_b_replaced_with_mean = df.fillna(value={'b':mean_b})
df_missing_b_replaced_with_mean

In [None]:
df_missing_b_replaced_with_mean_missing_c_replaced_with_zero = df.fillna(value={'b':mean_b,
                                                                                'c':0})
df_missing_b_replaced_with_mean_missing_c_replaced_with_zero

In [None]:
df.fillna('Unknown', inplace=True)
df

### Describing numeric data

In [None]:
df = pd.DataFrame({'a':[1,1,1,2,2,3,4,5],
                  'b':[10,10,11,20,None,None,40,50],
                  'c':['apple','apple','plum','pear','plum','apple','apple','apple']})

In [None]:
df

In [None]:
df.describe()

In [None]:
df.max()

In [None]:
df['a'].max()

### Describing non numeric data

In [None]:
df

In [None]:
df['c'].unique()

In [None]:
df['c'].nunique()

In [None]:
df['c'].value_counts()

## Exercise

### 1 - exercise
Load the datasets into pandas dataframes called trip, weather and station. (Don't forget to import the pandas library first!) <br>
Create three variables called trip_duplicates_num, weather_duplicates_num and station_duplicatess_num that contains how many duplicated rows are in each dataframe.

In [None]:
### Your code here


### 1 - check yourself

In [None]:
print('Length of trip dataframe should be 144115 with 100 duplicated rows.\n\
The length of your dataframe is {} and you counted {} duplicated rows\n'.format(len(trip), trip_duplicates_num))
print('Length of weather dataframe should be 928 with 8 duplicated rows.\n\
The length of your dataframe is {} and you counted {} duplicated rows\n'.format(len(weather), weather_duplicates_num))
print('Length of station dataframe should be 69 with 0 duplicated rows.\n\
The length of your dataframe is {} and you counted {} duplicated rows\n'.format(len(station), station_duplicates_num))

### 2 - exercise
For all 3 dataframes delete the duplicated rows in place (without creating a new dataframe) <br>

In [None]:
### Your code here


### 2 - check yourself

In [None]:
print('Length of trip dataframe should be 144015.\n\
The length of your dataframe is {}\n'.format(len(trip)))
print('Length of weather dataframe should be 920.\n\
The length of your dataframe is {}\n'.format(len(weather)))
print('Length of station dataframe should be 69.\n\
The length of your dataframe is {}\n'.format(len(station)))

### 3 - exercise
For all dataframes check if there are columns with missing values. <br>
Create 3 lists called trip_columns_with_missing_data, weather_columns_with_missing_data, station_columns_with_missing_data that contains the names of the columns with missing values of the dataframe. <br>
You can populate these lists by hand, or as an advanced task you can use pandas methods.

In [None]:
### Your code here


### 3 - check yourself

In [None]:
print('Columns with missing values in the trip dataframe are:\n \
Subscription Type\n\
You have found:\n {}\n'.format(','.join(trip_columns_with_missing_data)))

print('Columns with missing values in the weather dataframe are:\n \
Max_Temperature_F,Mean_Temperature_F,Min_TemperatureF,Max_Gust_Speed_MPH,Events\n\
You have found:\n {}\n'.format(','.join(weather_columns_with_missing_data)))

print('Columns with missing values in the station dataframe are:\n \
\n\
You have found:\n {}\n'.format(','.join(station_columns_with_missing_data)))

### 4 - exercise
How many values are missing in each column with missing values in the dataframes? <br>
Display the answer in any format you would like. As an advance task, try to display only the names of the columns that have missing values in them.

In [None]:
### Your code here


### 4 - check yourself
The number of missing values in Max_Temperature_F column is 3 <br>
The number of missing values in Mean_Temperature_F column is 3<br>
The number of missing values in Min_TemperatureF column is 3<br>
The number of missing values in Max_Gust_Speed_MPH column is 138<br>
The number of missing values in Events column is 782<br>
The number of missing values in Subscription Type column is 10<br>

### 5 - exercise
Before deciding how to deal with the missing values, let's get more familiar with the data! <br>
 - How many values are in the columns with missing data? As an advanced task, try to display the number of rows for only these columns, not the others.
 - Display the mean of each numeric column in the weather dataframe. Which are 2 columns with the lowest mean? As an advanced task, try to display them in descending order!
 - And what about the Events column in the weather dataframe? What are the unique values and how many times do they occur?

In [None]:
### Your code here


### 5 - check yourself
The number of rows without missing data are:
- Max_Temperature_F column is 917
- Mean_Temperature_F column is 917
- Min_TemperatureF column is 917
- Max_Gust_Speed_MPH column is 782
- Events column is 138
- Subscription Type column is 144005

The columns with the lowest mean values are Cloud_clover and Mean_Wind_Speed_MPH <br><br>
In the Events column there are 101 rows with Rain, 34 rows with Fog, 2 rows with rain and 1 row with Fog-Rain

### 6 - exercise
So let's decide what we will do with the missing data!<br>
- In the Temperature and the Gust Speed columns, there are not too many missing data, so let's fill those cells with the mean of the column.
- In the Events columns, the missing value means that there were no rain or fog that day. So let's fill those cells with the string 'no_event'
- In the Subscription type column, only a few rows are missing and we can't guess the original values. Let's delete those rows from the dataframe.

Create a new weather and trip dataframe called weather_filled and trip_filled where these solutions are applied!

In [None]:
### Your code here


### 6 - check yourself

In [None]:
print('In the weather_filled dataframe the number of missing data should be 0\n \
and in your dataframe the number is {}\n'.format(weather_filled.isnull().any().sum()))
print('The number of rows where the replacing value is not correct in the weather_filled dataframe: \n \
in column Max_Temperature_F is {} \n \
in column Mean_Temperature_F is {} \n \
in column Min_TemperatureF is {} \n \
in column Max_Gust_Speed_MPH is {} \n \
in column Events is {} \n '.format((weather_filled[weather.Max_Temperature_F.isnull()]['Max_Temperature_F'] != weather.Max_Temperature_F.mean()).sum(),
                                             (weather_filled[weather.Mean_Temperature_F.isnull()]['Mean_Temperature_F'] != weather.Mean_Temperature_F.mean()).sum(),
                                             (weather_filled[weather.Min_TemperatureF.isnull()]['Min_TemperatureF'] != weather.Min_TemperatureF.mean()).sum(),
                                             (weather_filled[weather.Max_Gust_Speed_MPH.isnull()]['Max_Gust_Speed_MPH'] != weather.Max_Gust_Speed_MPH.mean()).sum(),
                                             (weather_filled[weather.Events.isnull()]['Events'] != 'no_event').sum()))
print('The length of the trip_filled dataframe should be 144005\n \
and in your dataframe number is {}\n'.format(len(trip_filled)))


### 7 - exercise
Save these new dataframes into csv-s called weather_filled.csv and trip_filled.csv!

In [None]:
### Your code here


### 7 - check yourself

In [None]:
if 'trip_filled.csv' in os.listdir():
    print('trip_filled.csv was successfully saved')
else:
    print('trip.csv was NOT successfully saved')
if 'weather_filled.csv' in os.listdir():
    print('weather_filled.csv was successfully saved')
else:
    print('weather_filled.csv was NOT successfully saved')