# Data wrangling: Assesing data
Notes on assesing the quality of data. Normally this is done in conjunction with cleaning but these 
notes will just focus on the assessment of data.

In [1]:
import pandas as pd

### Patient data set
The data sets used are fake data sets for a clinical trial for an insulin alternative - Auralin. The data set includes information o patients who tried the new drug as well as a control sample. The patients that took part in this trial were all over the age of 18. 

First we will import the three datasets and have a quick look at them to see if there's any data that looks like it might be incorrect.

In [2]:
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')

### Assess
These are the programmatic assessment methods in pandas that will probably be used most often:

* .head (DataFrame and Series) - show the first few entries in the data set
* .tail (DataFrame and Series) - show the last few entries
* .sample (DataFrame and Series) - get a random sample of the data set
* .info (DataFrame only) - Shows the total number of entries in each column and the data type (i.e., object, int, float etc.) for each column
* .describe (DataFrame and Series) - Gives statistical information for each column (standard dev, mean, IQR etc.)
* .value_counts (Series only) - Gives a count for each variable within a column
* Various methods of indexing and selecting data (.loc and bracket notation with/without boolean indexing, also .iloc)

In [3]:
patients.head()

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1


In [4]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [5]:
adverse_reactions.head()

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation


We want to see if there is any duplicate data. First we'll take a look at the value counts for patients and their addresses.

In [6]:
patients.surname.value_counts()

Doe            6
Jakobsen       3
Taylor         3
Lâm            2
Woźniak        2
Tucker         2
Kadyrov        2
Correia        2
Nilsen         2
Silva          2
Cabrera        2
Batukayev      2
Souza          2
Aranda         2
Liễu           2
Schiavone      2
Grímsdóttir    2
Tạ             2
Parker         2
Hueber         2
Lund           2
Gersten        2
Johnson        2
Bùi            2
Ogochukwu      2
Lương          2
Berg           2
Cindrić        2
Kowalczyk      2
Collins        2
              ..
Tikhonov       1
Eldarkhanov    1
Martinsen      1
Grant          1
Heilmann       1
Rap            1
Citizen        1
Uspenskaya     1
Sandgreen      1
Dreher         1
Mancini        1
Wolfe          1
Montagu        1
Beauvais       1
Hsu            1
Bjarkason      1
Tromp          1
Hunter         1
Ibragimov      1
Webb           1
Knudsen        1
Uspensky       1
Bouw           1
Tabata         1
Schneider      1
Lansell        1
Lynge          1
Ekechukwu     

There are a lot of patients with the surname Doe but perhaps this is just a coincidence. We can take a look at addresses to see if ther are any repeat values

In [7]:
patients.address.value_counts()

123 Main Street                  6
2778 North Avenue                2
648 Old Dear Lane                2
2476 Fulton Street               2
3464 Big Indian                  1
2246 Pheasant Ridge Road         1
4943 Isaacs Creek Road           1
4977 Arlington Avenue            1
3538 Paul Wayne Haggerty Road    1
4148 Callison Lane               1
4932 Goldleaf Lane               1
2915 Lynn Avenue                 1
142 Broad Street                 1
3977 Jail Drive                  1
4682 Science Center Drive        1
1079 Ingram Street               1
1333 Comfort Court               1
456 Delaware Avenue              1
3390 Hidden Meadow Drive         1
995 Beechwood Avenue             1
2935 Diamond Cove                1
108 Griffin Street               1
4145 Fairfax Drive               1
1233 Liberty Avenue              1
3414 Franklin Avenue             1
4160 Pratt Avenue                1
1965 Crestview Manor             1
883 Oakwood Circle               1
1168 Stout Street   

There are also 6 patients with an address of 123 Main street. This is likely to be a filler address to replace missing information.

In [8]:
patients[patients.address.duplicated()]

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
29,30,male,Jake,Jakobsen,648 Old Dear Lane,Port Jervis,New York,12771.0,United States,JakobCJakobsen@einrot.com+1 (845) 858-7707,8/1/1985,155.8,67,24.4
219,220,male,Mỹ,Quynh,,,,,,,4/9/1978,237.8,69,35.1
229,230,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
230,231,female,Elisabeth,Knudsen,,,,,,,9/23/1976,165.9,63,29.4
234,235,female,Martina,Tománková,,,,,,,4/7/1936,199.5,65,33.2
237,238,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
242,243,male,John,O'Brian,,,,,,,2/25/1957,205.3,74,26.4
244,245,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4
249,250,male,Benjamin,Mehler,,,,,,,10/30/1951,146.5,69,21.6
251,252,male,John,Doe,123 Main Street,New York,NY,12345.0,United States,johndoe@email.com1234567890,1/1/1975,180.0,72,24.4


We can see that there is a ** patient John Doe that appears several times ** with the same information other than patient ID.This patient may have perhaps had multiple appointments and recorded as a new patient every time. 

Next we'll take a look at the weights to make sure that they seem to be within a reasonable range

In [9]:
patients.weight.sort_values()

210     48.8
459    102.1
335    102.7
74     103.2
317    106.0
171    106.5
51     107.1
270    108.1
198    108.5
48     109.1
478    109.6
141    110.2
38     111.8
438    112.0
14     112.0
235    112.2
307    112.4
191    112.6
408    113.1
49     113.3
326    114.0
338    114.1
253    117.0
321    118.4
168    118.8
1      118.8
350    119.0
207    119.2
265    120.0
341    120.3
       ...  
332    224.0
252    224.2
12     224.2
222    224.8
166    225.3
111    225.9
101    226.2
150    226.6
352    227.7
428    227.7
88     227.7
13     228.4
339    229.0
182    230.3
121    230.8
257    231.7
395    231.9
246    232.1
219    237.8
11     238.7
50     238.9
441    239.1
499    239.6
439    242.0
487    242.4
144    244.9
61     244.9
283    245.5
118    254.5
485    255.9
Name: weight, Length: 503, dtype: float64

The weight for this data set is recorded in pounds. **One of the patients has a weight of only 48lb.** This is highly unlikely as the data set only includes people who are age 18 or above. We should note that this is likely to be an error so that we can revisit during the cleaning phase.

Next we can check to see if there are any missing entries ofr either of the drugs used in this trial:

In [10]:
sum(treatments.auralin.isnull())

0

In [11]:
sum(treatments.novodra.isnull())

0