<a href="https://colab.research.google.com/github/shammud/python/blob/main/Copy_of_Air_quality_mini_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clean and wrangle air quality data

The following data file contains data collected at a roadside monitoring station.  You can see the data in a spreadsheet here: https://docs.google.com/spreadsheets/d/1XpAvrpuyMsKDO76EZ3kxuddBOu7cZX1Od4uEts14zco/edit?usp=sharing

The data contains:
* a heading line (Chatham Roadside) which needs to be skipped
* dates which are sometimes left- and sometimes right-justified indicating that they are not formatted as dates, rather they are text (so need to be converted to dates)
* times which are not all in the same format
* Nitrogen Dioxide levels which are, again, text and sometimes contain nodata
* Status which is always the same





### Project - clean, sort and wrangle the data

Read the dataset into a dataframe, skipping the first row   
Convert dates to date format  
Remove rows with nodata in the Nitrogen dioxide column  
Convert the Nitrogen dioxide levels values to float type  
Sort by Nitrogen dioxide level  
Create a new column for 'Weekdays' (use df['Date'].dt.weekday)  
Rename the column Nitrogen dioxide level to NO2 Level (V ug/m2)  
Remove the Status column  

The dataset can be viewed here:  https://drive.google.com/file/d/1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ/view?usp=sharing  and the data accessed here: https://drive.google.com/uc?id=1SOe9b4VJ1FCtDVgZ2T8d00-jTw2Kux1i  This is a .csv file  

**NOTE:** Some useful references are included at the bottom of this spreadsheet.

Use the code cell below to work your code.

In [174]:
import pandas as pd
data=pd.read_csv("https://drive.google.com/uc?id=1SOe9b4VJ1FCtDVgZ2T8d00-jTw2Kux1i",skiprows=1)
def get_data():
 c=data[data['Nitrogen dioxide'] != 'nodata']
 a=c.sort_values(by=["Nitrogen dioxide"])
 a['Date']= pd.to_datetime(a['Date'])
 a['Nitrogen dioxide'] =pd.to_numeric(a['Nitrogen dioxide'], errors='coerce')
 a['Weekdays']=a['Date'].dt.strftime("%A")
 a.rename(columns={"Nitrogen dioxide": "NO2"},inplace=True)
 a.drop("Status",axis=1,inplace=True)
#  print(len(data),len(a))
#  a.info()
 return a
get_data()

Unnamed: 0,Date,Time,NO2,Weekdays
8668,2020-12-27,5:00,0.42410,Sunday
5712,2020-08-26,1:00,0.58689,Wednesday
4489,2020-06-07,2:00,0.58930,Sunday
5714,2020-08-26,3:00,0.59123,Wednesday
8669,2020-12-27,6:00,0.65300,Sunday
...,...,...,...,...
674,2020-01-29,3:00,9.99194,Wednesday
2570,2020-04-17,3:00,9.99557,Friday
8267,2020-10-12,12:00,9.99864,Monday
6499,2020-09-27,20:00,9.99883,Sunday


### Expand the dataset and show summary statistics for larger dataset
---

There is a second data set here covering the year 2021:  https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ  

Concatenate the two datasets to expand it to 2020 and 2021.  

Before you can concatenate the datasets you will need to clean and wrangle the second dataset in the same way as the first.  Use the code cell below.  Give the second dataset a different name. 

After the datasets have been concatenated, group the data by Weekdays and show summary statistics by day of the week.

In [200]:
import pandas as pd
data1=pd.read_csv("https://drive.google.com/uc?id=1aYmBf9il2dWA-EROvbYRCZ1rU2t7JwvJ",skiprows=1)
def get_data2():
 a=get_data()
 # clean and wrangle the second dataset
 b=data1[data1['Nitrogen dioxide'] != 'nodata']
 a1=b.sort_values(by=["Nitrogen dioxide"])
 a1['Date']= pd.to_datetime(a1['Date'])
 a1['Nitrogen dioxide'] =pd.to_numeric(a1['Nitrogen dioxide'], errors='coerce')
 a1['Weekdays']=a1['Date'].dt.strftime("%A")
 a1.rename(columns={"Nitrogen dioxide": "NO2"},inplace=True)
 a1.drop("Status",axis=1,inplace=True)
 new=pd.concat([a,a1],axis=0)
 days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
 mean=new[["Weekdays","NO2"]].groupby("Weekdays").mean().reindex(days)
 std=new[["Weekdays","NO2"]].groupby("Weekdays").std().reindex(days)
 max=new[["Weekdays","NO2"]].groupby("Weekdays").max().reindex(days)
 min=new[["Weekdays","NO2"]].groupby("Weekdays").min().reindex(days)
 range=new[["Weekdays","NO2"]].groupby("Weekdays").max().reindex(days)-new[["Weekdays","NO2"]].groupby("Weekdays").min().reindex(days)
 sum=new[["Weekdays","NO2"]].groupby("Weekdays").sum().reindex(days)
 print("MEAN : ",mean)
 print("STD : ",std)
 print("MAX : ",max)
 print("MIN : ",min)
 print("RANGE : ",range)
 print("SUM : ",sum)
#  display(new.describe())
#  print(len(data1),len(a1))
#  print(len(new))
#  a1.info()
#  new.info()
get_data2()  

MEAN :                   NO2
Weekdays            
Monday     18.646198
Tuesday    19.386772
Wednesday  20.612752
Thursday   19.572436
Friday     19.447164
Saturday   17.217820
Sunday     15.259131
STD :                   NO2
Weekdays            
Monday     13.417775
Tuesday    13.746323
Wednesday  13.152223
Thursday   13.280281
Friday     12.383304
Saturday   12.096804
Sunday     10.460518
MAX :                   NO2
Weekdays            
Monday     113.06189
Tuesday     92.13063
Wednesday   73.40940
Thursday    76.46283
Friday      76.69458
Saturday    84.55297
Sunday      76.72297
MIN :                 NO2
Weekdays          
Monday     0.65360
Tuesday   -0.10519
Wednesday -0.77743
Thursday  -0.31174
Friday     0.03299
Saturday   1.20392
Sunday    -0.41740
RANGE :                   NO2
Weekdays            
Monday     112.40829
Tuesday     92.23582
Wednesday   74.18683
Thursday    76.77457
Friday      76.66159
Saturday    83.34905
Sunday      77.14037
SUM :                     NO2
Weekd

### Helpful references
---
Skipping rows when reading datasets:  
https://www.geeksforgeeks.org/how-to-skip-rows-while-reading-csv-file-using-pandas/  

Converting strings to dates:  
https://www.geeksforgeeks.org/convert-the-column-type-from-string-to-datetime-format-in-pandas-dataframe/

Dropping rows where data has a given value:  
https://www.datasciencemadesimple.com/drop-delete-rows-conditions-python-pandas/  
(see section Drop a row or observation by condition) 

Convert a column of strings to a column of floats:
https://datatofish.com/convert-string-to-float-dataframe/  

Create a new column from data converted in an existing column:  
https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/  

Rename a column:  
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html  

Remove a column by name:  
https://www.kite.com/python/answers/how-to-delete-columns-from-a-pandas-%60dataframe%60-by-column-name-in-python#:~:text=Use%20the%20del%20keyword%20to,the%20name%20column_name%20from%20DataFrame%20.
