# 2. Refine the Data
 
> "Data is messy"

We will be performing the following operation on our Onion price to refine it
- **Remove** e.g. remove redundant data from the data frame
- **Derive** e.g. State and City from the market field
- **Parse** e.g. extract date from year and month column

Other stuff you may need to do to refine are...
- **Missing** e.g. Check for missing or incomplete data
- **Quality** e.g. Check for duplicates, accuracy, unusual data
- **Convert** e.g. free text to coded value
- **Calculate** e.g. percentages, proportion
- **Merge** e.g. first and surname for full name
- **Aggregate** e.g. rollup by year, cluster by area
- **Filter** e.g. exclude based on location
- **Sample** e.g. extract a representative data
- **Summary** e.g. show summary stats like mean

In [2]:
# Import the one library we need, which is dplyr
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [16]:
# Read the csv file of Month Wise Market Arrival data that has been scraped.
df = read.csv('MonthWiseMarketArrivalsAll.csv')

In [17]:
head(df)

Unnamed: 0,Market,Month.Name,Year,Arrival..q.,Price.Minimum..Rs.q.,Price.Maximum..Rs.q.,Modal.Price..Rs.q.
1,ABOHAR(PB),January,2005,2350,404,493,446
2,ABOHAR(PB),January,2006,900,487,638,563
3,ABOHAR(PB),January,2010,790,1283,1592,1460
4,ABOHAR(PB),January,2011,245,3067,3750,3433
5,ABOHAR(PB),January,2012,1035,523,686,605
6,ABOHAR(PB),January,2013,675,1327,1900,1605


In [5]:
tail(df)

Unnamed: 0,market,month,year,quantity,priceMin,priceMax,priceMod
10223,YEOLA(MS),December,2011,131326,282,612,526
10224,YEOLA(MS),December,2012,207066,485,1327,1136
10225,YEOLA(MS),December,2013,215883,472,1427,1177
10226,YEOLA(MS),December,2014,201077,446,1654,1456
10227,YEOLA(MS),December,2015,223315,609,1446,1126
10228,,,Total,783438108,647(Avg),1213(Avg),984(Avg)


## Fix the column names

In [None]:
column_names <- c('market', 'month', 'year', 'quantity', 'priceMin', 'priceMax')

## Remove the redundant data

In [6]:
str(df)

'data.frame':	10228 obs. of  7 variables:
 $ market  : Factor w/ 121 levels "","ABOHAR(PB)",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ month   : Factor w/ 13 levels "","April","August",..: 6 6 6 6 6 6 6 6 5 5 ...
 $ year    : Factor w/ 22 levels "1996","1997",..: 10 11 15 16 17 18 19 20 10 11 ...
 $ quantity: int  2350 900 790 245 1035 675 440 1305 1400 1800 ...
 $ priceMin: Factor w/ 2027 levels "100","1000","1001",..: 1326 1464 278 1123 1511 319 28 300 1079 1205 ...
 $ priceMax: Factor w/ 2733 levels "1000","1001",..: 2135 2366 573 1760 2418 867 463 823 1723 1879 ...
 $ priceMod: Factor w/ 2425 levels "100","1000","1001",..: 1811 1987 471 1513 2031 614 258 622 1452 1625 ...


In [7]:
# Delete the last row from the dataframe
tail(df, n = 1)

Unnamed: 0,market,month,year,quantity,priceMin,priceMax,priceMod
10228,,,Total,783438108,647(Avg),1213(Avg),984(Avg)


In [9]:
# Delete the last row from the dataframe
df <- df %>%
      filter(year != "Total")

In [10]:
tail(df)

Unnamed: 0,market,month,year,quantity,priceMin,priceMax,priceMod
10222,YEOLA(MS),December,2010,57586,541,2713,1830
10223,YEOLA(MS),December,2011,131326,282,612,526
10224,YEOLA(MS),December,2012,207066,485,1327,1136
10225,YEOLA(MS),December,2013,215883,472,1427,1177
10226,YEOLA(MS),December,2014,201077,446,1654,1456
10227,YEOLA(MS),December,2015,223315,609,1446,1126


In [11]:
str(df)

'data.frame':	10227 obs. of  7 variables:
 $ market  : Factor w/ 121 levels "","ABOHAR(PB)",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ month   : Factor w/ 13 levels "","April","August",..: 6 6 6 6 6 6 6 6 5 5 ...
 $ year    : Factor w/ 22 levels "1996","1997",..: 10 11 15 16 17 18 19 20 10 11 ...
 $ quantity: int  2350 900 790 245 1035 675 440 1305 1400 1800 ...
 $ priceMin: Factor w/ 2027 levels "100","1000","1001",..: 1326 1464 278 1123 1511 319 28 300 1079 1205 ...
 $ priceMax: Factor w/ 2733 levels "1000","1001",..: 2135 2366 573 1760 2418 867 463 823 1723 1879 ...
 $ priceMod: Factor w/ 2425 levels "100","1000","1001",..: 1811 1987 471 1513 2031 614 258 622 1452 1625 ...


In [None]:
df.iloc[:,4:7].head()

In [None]:
df.iloc[:,2:7] = df.iloc[:,2:7].astype(int)

In [12]:
str(df)

'data.frame':	10227 obs. of  7 variables:
 $ market  : Factor w/ 121 levels "","ABOHAR(PB)",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ month   : Factor w/ 13 levels "","April","August",..: 6 6 6 6 6 6 6 6 5 5 ...
 $ year    : Factor w/ 22 levels "1996","1997",..: 10 11 15 16 17 18 19 20 10 11 ...
 $ quantity: int  2350 900 790 245 1035 675 440 1305 1400 1800 ...
 $ priceMin: Factor w/ 2027 levels "100","1000","1001",..: 1326 1464 278 1123 1511 319 28 300 1079 1205 ...
 $ priceMax: Factor w/ 2733 levels "1000","1001",..: 2135 2366 573 1760 2418 867 463 823 1723 1879 ...
 $ priceMod: Factor w/ 2425 levels "100","1000","1001",..: 1811 1987 471 1513 2031 614 258 622 1452 1625 ...


In [13]:
head(df)

Unnamed: 0,market,month,year,quantity,priceMin,priceMax,priceMod
1,ABOHAR(PB),January,2005,2350,404,493,446
2,ABOHAR(PB),January,2006,900,487,638,563
3,ABOHAR(PB),January,2010,790,1283,1592,1460
4,ABOHAR(PB),January,2011,245,3067,3750,3433
5,ABOHAR(PB),January,2012,1035,523,686,605
6,ABOHAR(PB),January,2013,675,1327,1900,1605


In [None]:
df.describe()

## Extracting the states from market names

In [None]:
df.market.value_counts().head()

In [None]:
df['state'] = df.market.str.split('(').str[-1]

In [None]:
df.head()

In [None]:
df['city'] = df.market.str.split('(').str[0]

In [None]:
df.head()

In [None]:
df.state.unique()

In [None]:
df['state'] = df.state.str.split(')').str[0]

In [None]:
df.state.unique()

In [None]:
dfState = df.groupby(['state', 'market'], as_index=False).count()

In [None]:
dfState.market.unique()

In [None]:
state_now = ['PB', 'UP', 'GUJ', 'MS', 'RAJ', 'BANGALORE', 'KNT', 'BHOPAL', 'OR',
       'BHR', 'WB', 'CHANDIGARH', 'CHENNAI', 'bellary', 'podisu', 'UTT',
       'DELHI', 'MP', 'TN', 'Podis', 'GUWAHATI', 'HYDERABAD', 'JAIPUR',
       'WHITE', 'JAMMU', 'HR', 'KOLKATA', 'AP', 'LUCKNOW', 'MUMBAI',
       'NAGPUR', 'KER', 'PATNA', 'CHGARH', 'JH', 'SHIMLA', 'SRINAGAR',
       'TRIVENDRUM']

In [None]:
state_new =['PB', 'UP', 'GUJ', 'MS', 'RAJ', 'KNT', 'KNT', 'MP', 'OR',
       'BHR', 'WB', 'CH', 'TN', 'KNT', 'TN', 'UP',
       'DEL', 'MP', 'TN', 'TN', 'ASM', 'AP', 'RAJ',
       'MS', 'JK', 'HR', 'WB', 'AP', 'UP', 'MS',
       'MS', 'KER', 'BHR', 'HR', 'JH', 'HP', 'JK',
       'KEL']

In [None]:
df.state = df.state.replace(state_now, state_new)

In [None]:
df.state.unique()

## Getting the Dates

In [None]:
df.index

In [None]:
pd.to_datetime('January 2012')

In [None]:
df['date'] = df['month'] + '-' + df['year'].map(str)

In [None]:
df.head()

In [None]:
index = pd.to_datetime(df.date)

In [None]:
df.index = pd.PeriodIndex(df.date, freq='M')

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.head()

In [None]:
df.to_csv('MonthWiseMarketArrivals_Clean.csv', index = False)