# This is the Data Cleaning & Validation Challenge!

### Below is the .head() to show our dataset column.

In [268]:
import numpy as np
import pandas as pd
import scipy as sc
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv("wellcome.csv", encoding ='cp1252')
df.head()

Unnamed: 0,PMID/PM,Publisher,Journal tit,Article titl,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psycholog,Reduced,£0.00
1,PMC36795,ACS,Biomacro,Structural,£2381.04
2,23043264,ACS,J Med Che,Fumaroyl,£642.56
3,23438330,ACS,J Med Che,Orvinols,£669.64
4,23438216,ACS,J Org Che,Regiosele,£685.88


## First I want to remove/drop all rows that have NAN as it give no information.

In [270]:
new_df = df.dropna()
new_df.head()

Unnamed: 0,PMID/PM,Publisher,Journal tit,Article titl,COST (£) charged to Wellcome (inc VAT when charged)
1,PMC36795,ACS,Biomacro,Structural,£2381.04
2,23043264,ACS,J Med Che,Fumaroyl,£642.56
3,23438330,ACS,J Med Che,Orvinols,£669.64
4,23438216,ACS,J Org Che,Regiosele,£685.88
5,PMC35794,ACS,Journal of,Comparat,£2392.20


## I want rename the column PMID/PM to PMID and COST (£) charged to Wellcome (inc VAT when charged) to cost for easier column search.

In [272]:
new_df.rename(columns = {'PMID/PM': 'PMID'}, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [273]:
new_df.rename(columns = {'COST (£) charged to Wellcome (inc VAT when charged)': 'cost'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


## Printing out the .head to see a preview of what changes.

In [275]:
new_df.head()

Unnamed: 0,PMID,Publisher,Journal tit,Article titl,cost
1,PMC36795,ACS,Biomacro,Structural,£2381.04
2,23043264,ACS,J Med Che,Fumaroyl,£642.56
3,23438330,ACS,J Med Che,Orvinols,£669.64
4,23438216,ACS,J Org Che,Regiosele,£685.88
5,PMC35794,ACS,Journal of,Comparat,£2392.20


## I only want to search PMC follow by digits only.

In [277]:
new_df[new_df['PMID'].str.contains('([A-Z][0-9])')].head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,PMID,Publisher,Journal tit,Article titl,cost
1,PMC36795,ACS,Biomacro,Structural,£2381.04
5,PMC35794,ACS,Journal of,Comparat,£2392.20
6,PMC37092,ACS,Journal of,Mapping,£2367.95
14,PMC3413,ACS Publi,Biochemi,Monomer,£665.64
15,PMC36943,ACS Publi,Journal of,Synthesis,£1006.72


## Now I'm removing the $ and £ symbol on the Cost Column to later obtain the mean, medium, standard deviation without error. Also to change the Column under Cost to be all numeric.

In [314]:
new_df['cost'] = new_df['cost'].str.replace("£", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [316]:
new_df['cost'] = new_df['cost'].str.replace("$", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [318]:
new_df['cost'] = new_df['cost'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


## Determine the five most common journals and the total articles for each.

In [281]:
new_df['Journal tit'].value_counts().head(6)

Journal of    277
PLoS One       91
PLoS ONE       62
Proceedin      39
Molecular      35
American       27
Name: Journal tit, dtype: int64

## calculating the Mean, Median, and SD for PLoS One article.

In [327]:
cost = new_df.groupby('Journal tit').agg({'cost':['sum', 'mean', 'std']}).reset_index()

In [328]:
cost.loc[cost['Journal tit']== "PLoS One"]['cost']

Unnamed: 0,sum,mean,std
398,2274595.91,24995.559451,148336.487817


## calculating the Mean, Median, and SD for PLoS ONE article.

In [329]:
cost.loc[cost['Journal tit']== "PLoS ONE"]['cost']

Unnamed: 0,sum,mean,std
397,3053420.47,49248.717258,216138.48622


## calculating the Mean, Median, and SD for Proceedin.

In [330]:
cost.loc[cost['Journal tit']== "Proceedin"]['cost']

Unnamed: 0,sum,mean,std
436,38752.33,993.649487,489.82359


## calculating the Mean, Median, and SD for Molecular.

In [331]:
cost.loc[cost['Journal tit']== "Molecular"]['cost']

Unnamed: 0,sum,mean,std
330,2071276.9,59179.34,234996.239285


## calculating the Mean, Median, and SD for American.

In [332]:
cost.loc[cost['Journal tit']== "American"]['cost']

Unnamed: 0,sum,mean,std
29,55408.59,2052.17,536.757761
