# Assignment: Data Wrangling
## `! git clone https://github.com/DS3001/wrangling`
## Do Q2, and one of Q1 or Q3.

**Q1.** Open the "tidy_data.pdf" document in the repo, which is a paper called Tidy Data by Hadley Wickham.

  1. Read the abstract. What is this paper about?
  2. Read the introduction. What is the "tidy data standard" intended to accomplish?
  3. Read the intro to section 2. What does this sentence mean: "Like families, tidy datasets are all alike but every messy dataset is messy in its own way." What does this sentence mean: "For a given dataset, it’s usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general."
  4. Read Section 2.2. How does Wickham define values, variables, and observations?
  5. How is "Tidy Data" defined in section 2.3?
  6. Read the intro to Section 3 and Section 3.1. What are the 5 most common problems with messy datasets? Why are the data in Table 4 messy? What is "melting" a dataset?
  7. Why, specifically, is table 11 messy but table 12 tidy and "molten"?
  8. Read Section 6. What is the "chicken-and-egg" problem with focusing on tidy data? What does Wickham hope happens in the future with further work on the subject of data wrangling?

**Q2.** This question provides some practice cleaning variables which have common problems.
1. Numeric variable: For `./data/airbnb_hw.csv`, clean the `Price` variable as well as you can, and explain the choices you make. How many missing values do you end up with? (Hint: What happens to the formatting when a price goes over 999 dollars, say from 675 to 1,112?)
2. Categorical variable: For the `./data/sharks.csv` data covered in the lecture, clean the "Type" variable as well as you can, and explain the choices you make.
3. Dummy variable: For the pretrial data covered in the lecture, clean the `WhetherDefendantWasReleasedPretrial` variable as well as you can, and, in particular, replace missing values with `np.nan`.
4. Missing values, not at random: For the pretrial data covered in the lecture, clean the `ImposedSentenceAllChargeInContactEvent` variable as well as you can, and explain the choices you make. (Hint: Look at the `SentenceTypeAllChargesAtConvictionInContactEvent` variable.)

In [6]:
import pandas as pd
import numpy as np
import seaborn as sns

df = pd.read_csv('/content/airbnb_hw.csv', low_memory=False)
print( df.shape, '\n')
df.head()

(30478, 13) 



Unnamed: 0,Host Id,Host Since,Name,Neighbourhood,Property Type,Review Scores Rating (bin),Room Type,Zipcode,Beds,Number of Records,Number Of Reviews,Price,Review Scores Rating
0,5162530,,1 Bedroom in Prime Williamsburg,Brooklyn,Apartment,,Entire home/apt,11249.0,1.0,1,0,145,
1,33134899,,"Sunny, Private room in Bushwick",Brooklyn,Apartment,,Private room,11206.0,1.0,1,1,37,
2,39608626,,Sunny Room in Harlem,Manhattan,Apartment,,Private room,10032.0,1.0,1,1,28,
3,500,6/26/2008,Gorgeous 1 BR with Private Balcony,Manhattan,Apartment,,Entire home/apt,10024.0,3.0,1,0,199,
4,500,6/26/2008,Trendy Times Square Loft,Manhattan,Apartment,95.0,Private room,10036.0,3.0,1,39,549,96.0


In [7]:
price = df['Price']
price.unique()

array(['145', '37', '28', '199', '549', '149', '250', '90', '270', '290',
       '170', '59', '49', '68', '285', '75', '100', '150', '700', '125',
       '175', '40', '89', '95', '99', '499', '120', '79', '110', '180',
       '143', '230', '350', '135', '85', '60', '70', '55', '44', '200',
       '165', '115', '74', '84', '129', '50', '185', '80', '190', '140',
       '45', '65', '225', '600', '109', '1,990', '73', '240', '72', '105',
       '155', '160', '42', '132', '117', '295', '280', '159', '107', '69',
       '239', '220', '399', '130', '375', '585', '275', '139', '260',
       '35', '133', '300', '289', '179', '98', '195', '29', '27', '39',
       '249', '192', '142', '169', '1,000', '131', '138', '113', '122',
       '329', '101', '475', '238', '272', '308', '126', '235', '315',
       '248', '128', '56', '207', '450', '215', '210', '385', '445',
       '136', '247', '118', '77', '76', '92', '198', '205', '299', '222',
       '245', '104', '153', '349', '114', '320', '292', '22

In [8]:
price = df['Price']
price = price.str.replace(',','')
print( price.unique() , '\n')
price = pd.to_numeric(price,errors='coerce')
print( price.unique() , '\n')
print( 'Total missing: ', sum( price.isnull() ) )

['145' '37' '28' '199' '549' '149' '250' '90' '270' '290' '170' '59' '49'
 '68' '285' '75' '100' '150' '700' '125' '175' '40' '89' '95' '99' '499'
 '120' '79' '110' '180' '143' '230' '350' '135' '85' '60' '70' '55' '44'
 '200' '165' '115' '74' '84' '129' '50' '185' '80' '190' '140' '45' '65'
 '225' '600' '109' '1990' '73' '240' '72' '105' '155' '160' '42' '132'
 '117' '295' '280' '159' '107' '69' '239' '220' '399' '130' '375' '585'
 '275' '139' '260' '35' '133' '300' '289' '179' '98' '195' '29' '27' '39'
 '249' '192' '142' '169' '1000' '131' '138' '113' '122' '329' '101' '475'
 '238' '272' '308' '126' '235' '315' '248' '128' '56' '207' '450' '215'
 '210' '385' '445' '136' '247' '118' '77' '76' '92' '198' '205' '299'
 '222' '245' '104' '153' '349' '114' '320' '292' '226' '420' '500' '325'
 '307' '78' '265' '108' '123' '189' '32' '58' '86' '219' '800' '335' '63'
 '229' '425' '67' '87' '1200' '158' '650' '234' '310' '695' '400' '166'
 '119' '62' '168' '340' '479' '43' '395' '144' '52' '47

In [9]:
df = pd.read_csv('/content/sharks.csv', low_memory=False)
df['Type'].value_counts()

Unprovoked             4716
Provoked                593
Invalid                 552
Sea Disaster            239
Watercraft              142
Boat                    109
Boating                  92
Questionable             10
Unconfirmed               1
Unverified                1
Under investigation       1
Boatomg                   1
Name: Type, dtype: int64

In [10]:
type = df['Type']
type = type.replace(['Sea Disaster', 'Boat','Boating','Boatomg'],'Watercraft')
type.value_counts()

type = type.replace(['Invalid', 'Questionable','Unconfirmed','Unverified','Under investigation'],np.nan)
type.value_counts()

df['Type'] = type
del type

df['Type'].value_counts()

Unprovoked    4716
Provoked       593
Watercraft     583
Name: Type, dtype: int64

In [11]:
df['Fatal (Y/N)'] = df['Fatal (Y/N)'].replace(['UNKNOWN', 'F','M','2017'],np.nan)
df['Fatal (Y/N)'] = df['Fatal (Y/N)'].replace('y','Y')
pd.crosstab(df['Type'],df['Fatal (Y/N)'],normalize='index')

Fatal (Y/N),N,Y
Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Provoked,0.967521,0.032479
Unprovoked,0.743871,0.256129
Watercraft,0.684303,0.315697


In [28]:
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_parquet('/content/justice_data.parquet')

In [29]:
release = df['WhetherDefendantWasReleasedPretrial']
print(release.unique(),'\n')
print(release.value_counts(),'\n')
release = release.replace(9,np.nan)
print(release.value_counts(),'\n')
sum(release.isnull())
df['WhetherDefendantWasReleasedPretrial'] = release
del release

[9 0 1] 

1    19154
0     3801
9       31
Name: WhetherDefendantWasReleasedPretrial, dtype: int64 

1.0    19154
0.0     3801
Name: WhetherDefendantWasReleasedPretrial, dtype: int64 



In [30]:
length = df['ImposedSentenceAllChargeInContactEvent']
type = df['SentenceTypeAllChargesAtConvictionInContactEvent']

length = pd.to_numeric(length,errors='coerce')
length_NA = length.isnull()
print( np.sum(length_NA),'\n')

print( pd.crosstab(length_NA, type), '\n')

length = length.mask( type == 4, 0)
length = length.mask( type == 9, np.nan)

length_NA = length.isnull()
print( pd.crosstab(length_NA, type), '\n')
print( np.sum(length_NA),'\n')

df['ImposedSentenceAllChargeInContactEvent'] = length
del length, type

9053 

SentenceTypeAllChargesAtConvictionInContactEvent     0     1    2     4    9
ImposedSentenceAllChargeInContactEvent                                      
False                                             8720  4299  914     0    0
True                                                 0     0    0  8779  274 

SentenceTypeAllChargesAtConvictionInContactEvent     0     1    2     4    9
ImposedSentenceAllChargeInContactEvent                                      
False                                             8720  4299  914  8779    0
True                                                 0     0    0     0  274 

274 



**Q3.** Many important datasets contain a race variable, typically limited to a handful of values often including Black, White, Asian, Latino, and Indigenous. This question looks at data gathering efforts on this variable by the U.S. Federal government.

1. How did the most recent US Census gather data on race?
2. Why do we gather these data? What role do these kinds of data play in politics and society? Why does data quality matter?
3. Please provide a constructive criticism of how the Census was conducted: What was done well? What do you think was missing? How should future large scale surveys be adjusted to best reflect the diversity of the population? Could some of the Census' good practices be adopted more widely to gather richer and more useful data?
4. How did the Census gather data on sex and gender? Please provide a similar constructive criticism of their practices.
5. When it comes to cleaning data, what concerns do you have about protected characteristics like sex, gender, sexual identity, or race? What challenges can you imagine arising when there are missing values? What good or bad practices might people adopt, and why?
6. Suppose someone invented an algorithm to impute values for protected characteristics like race, gender, sex, or sexuality. What kinds of concerns would you have?

1. the most recent U.S. Census was conducted in 2020 and race data were collected through a questionnaire that included specific questions related to race and ethnicity. Respondents were asked to select one or more races or ethnicities with which they identified. The categories provided on the questionnaire typically include options such as White, Black or African American, Asian, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, Other Race, and an option indicating Hispanic or Latino. search from website.

2.The collection of demographic data, including race and ethnicity, serves several important political, social, and policymaking purposes.

1. **Representation and Voting Rights**
2. **Policy Developemnt**
3. **Civil Rights**
4. **Resource Allocation**
5. **Research**

Data quality is critical to ensuring the reliability, accuracy and completeness of demographic data, as biases in the data can lead to poor policy decisions, misallocation of resources and unfair treatment of certain groups. It is therefore important that efforts to collect, process and analyse population statistics prioritize data quality, including methods to address sampling bias, ensure confidentiality and privacy and reduce errors in data collection and reporting.

3. The census did a good job of making sure that everyone was able to participate and that people's information was kept private. They reached out to different groups of people and used different ways to collect information. However, some people may be left out and not everyone can participate online. In addition, some people may feel uncomfortable based on their culture or language. To make surveys better in the future, they can communicate more with the community, make better use of technology, and listen to feedback. These changes will help ensure that everyone's voice is heard and more useful information is obtained.

4. Censuses collect data on gender by asking respondents whether they are male or female, and by offering the option of male, female or other. However, in order to be more inclusive, they could offer a wider range of gender options and provide clearer guidance on the difference between sex and gender. This would help to ensure that everyone's gender identity is accurately represented and respected in future surveys.

5. Concerns about protected characteristics such as sex, gender, sexual identity or race are raised when cleaning data because of their sensitivity and potential for discrimination. Problems with missing values can lead to biased analyses or inaccurate representations of populations. Good way is to use careful methods to estimate missing values, while bad practice may be to ignore missing values or make assumptions based on stereotypes, which can perpetuate bias. Careful handling of protected characteristics is essential to ensure fairness and accuracy in data analysis.

6. If someone invented an algorithm that could assign values to protected characteristics such as race, gender, sex, or sexual orientation, there would be concerns about bias, privacy violations, and ethical implications. Without an individual's explicit consent, an algorithm's decisions may inadvertently lead to discriminatory outcomes, violate privacy rights, and violate ethical principles. Transparency and accountability are critical, as unclear decision-making processes can lead to unfair or harmful attributions. In addition, inaccurate estimates can lead to faulty analysis and decision-making that may stigmatize individuals or groups. Ensuring compliance with legal and regulatory standards, while recognizing the nuances of protected characteristics, is essential for the responsible development and implementation of such algorithms.