#  Data Wrangling

1. Data Types and Structures: Understanding the different types of data and their properties is critical to performing effective data wrangling. This includes understanding the differences between numeric, categorical, and ordinal data, as well as the different data structures such as arrays, lists, tuples, and data frames. Some important concepts to study include:

Data types and structures in Python, including NumPy arrays and pandas data frames.

Data manipulation techniques, such as subsetting, indexing, and slicing data frames.

Dealing with missing or null values, including imputation techniques.

The concept of data normalization, and the importance of scaling data for some machine learning algorithms.


In [26]:
import pandas as pd
import numpy as np
import random 

In [27]:
x = [1,2,3,4,5,6,7,8,9,10]
x = np.array(x)
x

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [28]:
#Indexing into arrays print(array[position])
print(x[0])
#Should print 1

print()
print(x[5])
#Should print 6

print()
print(x[-1])
#Should print 10
# -1 is the easiest way to index to the last element without knowing
#How many elements are in the array 

print()
print(x[-5])
#Should print 6 again

1

6

10

6


In [29]:
#Slicing print(array[start:end])

#Should print 3 4
print(x[2:4])

print()

#Should print nothing because it starts at the last element

print(x[-1:3])

print()

#Should print 4-9
print(x[3:-1])

[3 4]

[]

[4 5 6 7 8 9]


In [30]:
#Boolean indexing
# Greater than > or equal to >=
# Less than < or equal to <=
# Equal to ==
# Not equal to !=

#Should print any elements in x greater than 5
print(x[x>5])

print()

#Should print any elements in x less than 5
print(x[x<5])

print()

#Should print any element less than 5 and greater than 7
print(x[x<5], x[x>7])


[ 6  7  8  9 10]

[1 2 3 4]

[1 2 3 4] [ 8  9 10]


In [31]:
# Creating pandas dataframe first by creating a dictionary with strings as the key and
# lists as the elements. 

In [36]:
#DO NOT RE RUN
mean = 100000
std = 5000
mean1 = .60
std1 = .05
countries = {"Country": ["Puerto-Rico", "Costa-Rica", "Cape-Verde", "Jamaica", "Haiti", 
                         "Dominican-Republic", "Bahamas","DROC","Nigeria","Kenya", "Somalia"],
             "Population": [random.gauss(mean, std) for i in range(11)],
             "Christianity":[random.gauss(mean1, std1) for i in range(11)],
             "Islam": [random.gauss(mean1 - .25, std1 - 0.003) for i in range(11)],
             "Judaism": [random.gauss(mean1 - .05, std1 - 0.009) for i in range(11)],
             "New-Age-Spirtuality": [random.gauss(.15, std1 - 0.00283) for i in range(11)],
             "Athiest": [random.gauss(.005, std1 + 0.000183) for i in range(11)],
             "Republican":[random.gauss(.30, std1) for i in range(11)],
             "Democrat":[random.gauss(mean1, std1) for i in range(11)],
             "High-School-Diploma": [random.gauss(.80, 0.07) for i in range(11)],
             "Bachelors-Degree": [random.gauss(.70, 0.09) for i in range(11)],
             "Masters-Degree": [random.gauss(.20, 0.1) for i in range(11)],
            
    
}
countries = pd.DataFrame(countries)
countries

Unnamed: 0,Country,Population,Christianity,Islam,Judaism,New-Age-Spirtuality,Athiest,Republican,Democrat,High-School-Diploma,Bachelors-Degree,Masters-Degree
0,Puerto-Rico,100127.452374,0.633353,0.327497,0.517109,0.239192,0.086903,0.304668,0.550039,0.832857,0.622138,0.059931
1,Costa-Rica,96153.97167,0.542686,0.385019,0.494381,0.1398,0.002624,0.170932,0.672915,0.845915,0.637808,0.102362
2,Cape-Verde,101424.044704,0.535925,0.363955,0.576849,0.102801,-0.008003,0.328306,0.516761,0.931418,0.905551,0.227774
3,Jamaica,100374.404071,0.612301,0.385329,0.485926,0.101321,0.03641,0.28681,0.719107,0.876761,0.516774,0.136936
4,Haiti,97130.721326,0.570261,0.386816,0.527725,0.212815,0.011233,0.234448,0.655362,0.790578,0.828495,0.084803
5,Dominican-Republic,96629.970071,0.655132,0.358988,0.595459,0.095969,-0.017774,0.263719,0.581337,0.837553,0.752743,0.021103
6,Bahamas,93293.973128,0.692577,0.367533,0.516618,0.133042,-0.019726,0.265492,0.592526,0.835589,0.722069,0.226927
7,DROC,95863.598368,0.524285,0.384296,0.659793,0.097799,-0.041794,0.221233,0.661141,0.838209,0.715435,0.155698
8,Nigeria,99120.232436,0.621453,0.401333,0.525858,0.084016,0.014652,0.313088,0.60948,0.774249,0.656542,0.217834
9,Kenya,97947.754982,0.640603,0.398482,0.536544,0.189703,0.076439,0.238025,0.572693,0.67348,0.573782,0.220559


In [38]:
# Column indexing through data frames
# Format: print(dataframe['column'])
# Use the tab key to easily find column names or use 
print(countries.columns)

print()
print(countries['Athiest'])

Index(['Country', 'Population', 'Christianity', 'Islam', 'Judaism',
       'New-Age-Spirtuality', 'Athiest', 'Republican', 'Democrat',
       'High-School-Diploma', 'Bachelors-Degree', 'Masters-Degree'],
      dtype='object')

0     0.086903
1     0.002624
2    -0.008003
3     0.036410
4     0.011233
5    -0.017774
6    -0.019726
7    -0.041794
8     0.014652
9     0.076439
10   -0.022994
Name: Athiest, dtype: float64


In [39]:
# Row indexing through data frames
# Format: print(dataframe.loc[row])

# This will print Costa-Rica data
print(countries.loc[1])

Country                 Costa-Rica
Population             96153.97167
Christianity              0.542686
Islam                     0.385019
Judaism                   0.494381
New-Age-Spirtuality         0.1398
Athiest                   0.002624
Republican                0.170932
Democrat                  0.672915
High-School-Diploma       0.845915
Bachelors-Degree          0.637808
Masters-Degree            0.102362
Name: 1, dtype: object


In [40]:
# Boolean Indexing
# Greater than > or equal to >=
# Less than < or equal to <=
# Equal to ==
# Not equal to !=
# Format: print(dataframe['desired_column'] condition)

#print all data with more than 60% of those who have a Bachelors degree

#First let's check the Bachelors-Degree column
print(countries['Bachelors-Degree'])

print()

print(countries['Bachelors-Degree'] > .60)


0     0.622138
1     0.637808
2     0.905551
3     0.516774
4     0.828495
5     0.752743
6     0.722069
7     0.715435
8     0.656542
9     0.573782
10    0.696225
Name: Bachelors-Degree, dtype: float64

0      True
1      True
2      True
3     False
4      True
5      True
6      True
7      True
8      True
9     False
10     True
Name: Bachelors-Degree, dtype: bool


2. Data Cleaning and Preprocessing: Raw data is rarely ready for analysis and often needs to be cleaned and preprocessed to remove errors, inconsistencies, or irrelevant data. Effective data cleaning and preprocessing can significantly improve the accuracy and quality of your data. Some important concepts to study include:

Identifying and handling outliers and anomalies in data.

Dealing with duplicates or redundant data.

Handling inconsistent data, such as data with spelling errors, abbreviations, or different formats.

Standardizing data to a common format or scale.

Applying data transformations, such as log transformations or power transformations.

3. Data Transformation: Transforming data can help convert data from one format to another, which can help you perform further analysis or build machine learning models. Some important concepts to study include:

Aggregating data by grouping and summarizing data.

Reshaping data using techniques like pivot tables or melt.

Combining data from multiple sources, including joining or merging data sets.

Creating new variables or features based on existing data.

4. Data Visualization: Data visualization is a key component of data wrangling as it can help you identify patterns and relationships in your data that might not be immediately apparent. Some important concepts to study include:

Creating effective visualizations using libraries like Matplotlib and Seaborn.

Choosing the appropriate chart type for your data, such as bar charts, line charts, or scatter plots.

Identifying and addressing issues with visualization, such as visual clutter, misleading scales, or overplotting.

Using visualizations to explore and communicate data, such as building dashboards or creating interactive visualizations.