# Assignment: Data Wrangling
## `! git clone https://github.com/DS3001/wrangling`
## Do Q2, and one of Q1 or Q3.

**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
  2. Read the introduction. What is the "tidy data standard" intended to accomplish?
  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."
  4. Read Section 2.2. How does Wickham define values, variables, and observations?
  5. How is "Tidy Data" defined in section 2.3?
  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?
  7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?
  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?

 1. Wickham explores the process of data cleaning and proposes a more abstract way of thinking about it. While removing missing data is common Wickham asks what should cleaned data ultimately look like? He introduces a set of criteria for tidy data, each row represents an observation, each column a variable, and each type of observational unit is organized into its own table. The paper delves into the implications of this framework for data analysis.

2. Data cleaning, despite being a time-consuming and conceptually challenging part of data analysis, is understudied. The "tidy data standard" aims to standardize this process, making it easier to clean data by providing a clear objective and a set of steps. The idea is that if everyone follows this standard, data cleaning becomes more straightforward and consistent.

3. The first sentence suggests that messy data sets present unique challenges, whereas tidy data sets follow a uniform structure. The second sentence highlights the challenge of defining variables and observations universally, even though it seems intuitive for most datasets.

4. Wickham defines a dataset as a collection of values, which can be numeric or categorical. Each value belongs to both a variable and an observation. A variable is a set of values measuring the same attribute and an observation is a collection of values corresponding to a single instance being measured.

5. Tidy data follows a specific structure, each variable is in a column, each observation is in a row, and each type of observational unit is organized into its own table. If data does not follow this structure, it is considered messy data.

6. The five most common problems are that column headers contain values, not variable names, multiple variables are stored in a single column, variables are spread across both rows and columns, different types of observational units are mixed in the same table and a single observational unit is split across multiple tables. Table 4 is messy because the columns represent values of an implicit variable , which should be made explicit in the dataset. "Melting" refers to the process of reshaping the data by turning columns into rows, so that variables are correctly represented.

7. Table 11 is messy because the days are represented as column headers, which are values rather than variable names. In Table 12, these values are melted into a single "date" variable, and the table is restructured. However, Table 12(a) still contains some issues because "tmax" and "tmin" are variable names that should be values. Table 12(b) resolves this and presents a fully tidy dataset.

8. Wickham identifies a "chicken-and-egg" problem with focusing on tidy data: if the tidy framework exists only to support specific tools, it may seem like a form of branding. However, Wickham hopes that the concept of tidy data will lead to the development of a broader philosophy for data cleaning, one that fosters a more comprehensive ecosystem of tools and approaches for data science as a whole.

**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)
2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.
3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
#1
df = pd.read_csv('./Users/tulsipatel/Desktop/important/airbnb_hw.csv', low_memory=False)
print( df.shape, '\n')
df.head()
price = df['Price']
price.unique()
price = df['Price']
price = price.str.replace(',','')
print( price.unique() , '\n')
price = pd.to_numeric(price,errors='coerce')
print( price.unique() , '\n')
print( 'Total missing: ', sum( price.isnull() ) )
df = pd.read_csv('./Users/tulsipatel/Desktop/important/sharks.csv', low_memory=False)
df.head()
df['Type'].value_counts()
type = df['Type']
type = type.replace(['Sea Disaster', 'Boat','Boating','Boatomg'],'Watercraft')
type.value_counts()
type = type.replace(['Invalid', 'Questionable','Unconfirmed','Unverified','Under investigation'],np.nan)
type.value_counts()
df['Type'] = type
del type
df['Type'].value_counts()
df['Fatal (Y/N)'] = df['Fatal (Y/N)'].replace(['UNKNOWN', 'F','M','2017'],np.nan)
df['Fatal (Y/N)'] = df['Fatal (Y/N)'].replace('y','Y')
pd.crosstab(df['Type'],df['Fatal (Y/N)'],normalize='index')
df = pd.read_csv('./Users/tulsipatel/Desktop/important/VirginiaPretrialData2017.csv', low_memory=False)
df.head()
release = df['WhetherDefendantWasReleasedPretrial']
print(release.unique(),'\n')
print(release.value_counts(),'\n')
release = release.replace(9,np.nan)
print(release.value_counts(),'\n')
sum(release.isnull())
df['WhetherDefendantWasReleasedPretrial'] = release
del release
length = df['ImposedSentenceAllChargeInContactEvent']
type = df['SentenceTypeAllChargesAtConvictionInContactEvent']

print( length.unique()  , '\n')
length = pd.to_numeric(length,errors='coerce')
length_NA = length.isnull()
print( np.sum(length_NA),'\n')

print( pd.crosstab(length_NA, type), '\n')

length = length.mask( type == 4, 0)
length = length.mask( type == 9, np.nan)

length_NA = length.isnull()
print( pd.crosstab(length_NA, type), '\n')
print( np.sum(length_NA),'\n') # 274 missing, much better

df['ImposedSentenceAllChargeInContactEvent'] = length
del length, type