# Challenge: Data cleaning & validation

Data cleaning is definitely a "practice makes perfect" skill. Using this dataset of article open-access prices paid by the WELLCOME Trust between 2012 and 2013, determine the five most common journals and the total articles for each. Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal . You will need to do considerable data cleaning in order to extract accurate estimates, and may want to look into data encoding methods if you get stuck. 

https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16

For a real bonus round, identify the open access prices paid by subject area.

As noted in the previous assignment, don't modify the data directly. Instead, write a cleaning script that will load the raw data and whip it into shape. Jupyter notebooks are a great format for this. Keep a record of your decisions: well-commented code is a must for recording your data cleaning decision-making progress. Submit a link to your script and results below and discuss it with your mentor at your next session.

In [57]:
import re
import pandas as pd
import numpy as np

df = pd.read_csv('~/thinkful_mac/thinkful_large_files/WELLCOME_APCspend2013_forThinkful.csv', encoding = 'ISO-8859-1')
#Needed to use a different encoding parameter than default in order to get the '£' symbol to load correctly
#The CSV would not load using the default encoder (utf-8)

df.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [58]:
#Explore the data with a groupby to help visualize some of the problems
df2 = df.groupby('Journal title').count()
df2.head(1000)

#Will need to clear spaces before and after, remove commas and semicolons, capitalize all titles

Unnamed: 0_level_0,PMID/PMCID,Publisher,Article title,COST (£) charged to Wellcome (inc VAT when charged)
Journal title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ACS Chemical Biology,4,5,5,5
ACS Chemical Neuroscience,1,1,1,1
ACS NANO,1,1,1,1
ACS Nano,1,1,1,1
ACTA F,1,1,1,1
AGE,1,1,1,1
AIDS,3,3,3,3
AIDS Behav,1,1,1,1
AIDS Care,2,2,2,2
AIDS Journal,1,1,1,1


In [59]:
#Here is our script to clean the data

def clean_dataframe(df):
    
    #Remove blank spaces before and after Journal title
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).strip())
    #Remove commas and colons
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).replace(':', ''))
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).replace(',', ''))
    #Put everything in all caps
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).upper())
    #Remove all text after " Section" to better group journals together
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).split(' SECTION', 1)[0])
    #Replace all " J " with "Journal"
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).replace('J ', 'JOURNAL '))
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).replace(' OF ', ' '))
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).replace(' ORG ', ' ORGANIC '))
    df['Journal title'] = df['Journal title'].apply(lambda x: str(x).strip())
    
    #Remove commas, $, £ signs from cost
    df['Cost New'] = df['COST (£) charged to Wellcome (inc VAT when charged)'].apply(lambda x: str(x).replace(',', ''))
    df['Cost New'] = df['Cost New'].apply(lambda x: str(x).replace('$', ''))
    df['Cost New'] = df['Cost New'].apply(lambda x: str(x).replace('£', ''))
    #Convert cost variable from string to a float to allow for mathematical operations
    df['Cost New'] = df['Cost New'].apply(lambda x: float(x))
    #Some cost amounts are 999.9K - exclude these, replace with NaN and drop
    df['Cost New'] = df['Cost New'].replace(999999.00, np.nan)
    df = df.dropna()

    #############****Replace 9999999 with NaN from np...up for debate as to whether 0 is better, 
    #then drop the NaN values using .dropna() from pandas. Leave the zeros in there - they may be valid. 

    #Some cost amounts are 0 - leave these in here as they could be valid

    #Could do more cleaning if this were a "real" work project, but this is sufficient for now
    
    return df

In [60]:
#Create a copy of the original dataframe so that the original is not modified in the cleaning script
df_copy = df
df3 = clean_dataframe(df_copy)
df3.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged),Cost New
1,PMC3679557,ACS,BIOMACROMOLECULES,Structural characterization of a Model Gram-ne...,£2381.04,2381.04
2,23043264 PMC3506128,ACS,JOURNAL MED CHEM,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56,642.56
3,23438330 PMC3646402,ACS,JOURNAL MED CHEM,Orvinols with mixed kappa/mu opioid receptor a...,£669.64,669.64
4,23438216 PMC3601604,ACS,JOURNAL ORGANIC CHEM,Regioselective opening of myo-inositol orthoes...,£685.88,685.88
5,PMC3579457,ACS,JOURNAL MEDICINAL CHEMISTRY,Comparative Structural and Functional Studies ...,£2392.20,2392.2


In [72]:
df_count = df3.groupby('Journal title', as_index = False).count()
df_count.head()

Unnamed: 0,Journal title,PMID/PMCID,Publisher,Article title,COST (£) charged to Wellcome (inc VAT when charged),Cost New
0,ACADEMY NUTRITION AND DIETETICS,1,1,1,1,1
1,ACS CHEMICAL BIOLOGY,4,4,4,4,4
2,ACS CHEMICAL NEUROSCIENCE,1,1,1,1,1
3,ACS NANO,2,2,2,2,2
4,ACTA CRYSTALLOGRAPHICA,5,5,5,5,5


In [73]:
#Calculate mean, median, and mode

df_mean = df3.groupby('Journal title', as_index = False).mean()
df_mean.head()

Unnamed: 0,Journal title,Cost New
0,ACADEMY NUTRITION AND DIETETICS,2379.54
1,ACS CHEMICAL BIOLOGY,1535.965
2,ACS CHEMICAL NEUROSCIENCE,1186.8
3,ACS NANO,668.14
4,ACTA CRYSTALLOGRAPHICA,779.122


In [74]:
df_med = df3.groupby('Journal title', as_index = False).median()
df_med.head()

Unnamed: 0,Journal title,Cost New
0,ACADEMY NUTRITION AND DIETETICS,2379.54
1,ACS CHEMICAL BIOLOGY,1294.685
2,ACS CHEMICAL NEUROSCIENCE,1186.8
3,ACS NANO,668.14
4,ACTA CRYSTALLOGRAPHICA,773.74


In [75]:
df_std = df3.groupby('Journal title').std(ddof=0)
df_std.head()

### Need to JOIN std dev cost column over to df_final (see below)
merged = pd.merge(df_std, df_count, left_index = True, right_on='Journal title', how = "inner")
merged.head()

Unnamed: 0,Cost New_x,Journal title,PMID/PMCID,Publisher,Article title,COST (£) charged to Wellcome (inc VAT when charged),Cost New_y
0,0.0,ACADEMY NUTRITION AND DIETETICS,1,1,1,1,1
1,433.593733,ACS CHEMICAL BIOLOGY,4,4,4,4,4
2,0.0,ACS CHEMICAL NEUROSCIENCE,1,1,1,1,1
3,25.25,ACS NANO,2,2,2,2,2
4,16.891956,ACTA CRYSTALLOGRAPHICA,5,5,5,5,5


In [76]:
df_final = pd.DataFrame()
df_final['Journal'] = df_count['Journal title']
df_final['Count'] = df_count['Cost New']
df_final['Mean'] = df_mean['Cost New']
df_final['Med'] = df_med['Cost New']
#df_final['StDev'] = df_std['Cost'] -- ********Throws error. Why can't I use as_index = False on STDEV?*******

### Need to MERGE std dev cost column over to df_final (see above)
df_final['StDev'] = merged['Cost New_x']

df_final = df_final.sort_values('Count', ascending = False)
df_final[ : 5]

Unnamed: 0,Journal,Count,Mean,Med,StDev
667,PLOS ONE,181,1994.896022,897.19,14211.649511
436,JOURNAL BIOLOGICAL CHEMISTRY,52,1417.393269,1301.14,412.814709
610,NEUROIMAGE,28,2230.718571,2335.04,253.16356
628,NUCLEIC ACIDS RESEARCH,25,1160.88,852.0,438.778698
661,PLOS GENETICS,22,1643.110909,1712.73,149.84068


In [89]:
#ALTERNATE SOLUTION#

import numpy as np
result = df3.groupby('Journal title', as_index=False).agg([np.count_nonzero,
                                                       np.mean, 
                                                       np.median, 
                                                       lambda x: np.std(x, ddof=0)])

#result.head()

#Try renaming the column
#result['Cost']['<lambda>'].name = 'Std Dev'
#result.head()
# Did not work... but did not throw error either?

In [91]:
#final = pd.DataFrame()
#final['Journal'] = result['Journal']
x = result['Cost New'].sort_values(['count_nonzero'], ascending = False)

x[:5]


Unnamed: 0_level_0,count_nonzero,mean,median,<lambda>
Journal title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PLOS ONE,181.0,1994.896022,897.19,14211.649511
JOURNAL BIOLOGICAL CHEMISTRY,52.0,1417.393269,1301.14,412.814709
NEUROIMAGE,28.0,2230.718571,2335.04,253.16356
NUCLEIC ACIDS RESEARCH,25.0,1160.88,852.0,438.778698
PLOS GENETICS,22.0,1643.110909,1712.73,149.84068


In [None]:
# In summary, either method I used to try to get the top 5, mean, med, std is valid