# Project 1

Author: Ryan Rosiak [rrosiak1@gulls.salisbury.edu] and Grant Dawson [gdawson1@gulls.salisbury.edu]

Date: 10/15/20

Description: Continuing to work with the adult data set mixed with probability

In [1]:
import pandas as pd # Pandas library
import numpy as np # Numpy library
import matplotlib.pyplot as plt # Matplotlib library

# Dataset 1: The Grocery Dataset

In [2]:
grocery_data = pd.read_csv('./GroceryDataset/Groceries_dataset.csv',
                          header=1,
                          skipinitialspace=False,
                          names=['Member_number', 'Date', 'itemDescription'])
grocery_data.head(100)

Unnamed: 0,Member_number,Date,itemDescription
0,2552,05-01-2015,whole milk
1,2300,19-09-2015,pip fruit
2,1187,12-12-2015,other vegetables
3,3037,01-02-2015,whole milk
4,4941,14-02-2015,rolls/buns
5,4501,08-05-2015,other vegetables
6,3803,23-12-2015,pot plants
7,2762,20-03-2015,whole milk
8,4119,12-02-2015,tropical fruit
9,1340,24-02-2015,citrus fruit


# What type of population is being sampled? What are the “things” getting measured – usually one per row of data.

The population of the data set is 38765 purchase orders of people from various grocery stores around the U.S. The "things" that are getting measured are the frequency of the most popular item of common grocery store orders. We can look at time trends of when certain items are being bought throughout each year as well as what items seem to be bought more than others. 

# What features does each sample have, i.e. what is being measured?

There are three features to each sample. The first feature is the Member_number which is the id associated with the receipt of each order by various members of each grocery store. The second feature is the Date which is the date of each purchase order. The third feature that is being measured is the itemDescription which is the name of the most popular (largest amount) item from each recorded purchase order. 

# Are the features quantitative or qualitative? Ordinal or nominal? Continuous or discrete?

The features that are being measured are quantitative and qualitative. Member_number and the date are quantitive data because they are recorded numberic data. The itemDescription is qualitative becuase it is non-numeric and provides a classification rather than a number to represent it. The itemDescription is nominal because the data represented is categorical and qualitative. The Date is interval data because it is numeric data that is on a scale. There is no absolute zero because time is continuous. The Member_number is ordinal data because it is numeric but the values do not hold any real weight to them. One number is not "greater" than another, it is simply a way of ordering what member has what id. Lastly, itemDescription is discrete because it is nominal categorical finite data. The Member_number is discrete because they are finite whole numeric numbers that represent a categorical ordering rather than a continuous number system. Lastly, the Date is continuous data because time is continuous and the data is not finite. It can be considered discrete because the Date is a set value rather than written down to the exact second, but overall, recorded time is continuous data.

# Is the data “complete” or do some of the samples have null or absent values for certain features? Why are these samples still useful? Why are they incomplete?

The data that is given is complete. There are no null values because in order for someone to complete a purchase, they have to have purchased at least one item, creating the value for itemDescription and they will be buying and item at some point in time of the day filling the Date attribute, and finally they will always have a Member_number associated with their order.

# Why are these features chosen to be part of the dataset?

These features are chosen to be apart of the dataset because we simply want to look at purchase trends for specific items in peoples grocery basket. We want the Member_number as metadata for each purchase, while we want the Date in order to create time trends for the amount of certain items being purchased at various times of the year or just trends in purchases of certain items between different years altogether. Lastly, the item description is needed because we need to know what items are being purchased the most frequently. This is the feature that allows us to examine trends and see differences from purchase to purchase.

# What are some other features that are not included but that you think might make sense to include for this dataset?

I believe that some useful features that are not included could be all of the other items that were purchased within the order. This could help us figure out if there are trends from member to member and multiple items that are trending between all of the purchases rather than just one. Some items could get left out when a trend could clearly be there. Another useful feature could be the specific grocery store in the U.S. that this purchase order is from. This can be used to examine trends in sales from store to store as well as trends from region to region during various times of the year or from year to year in general.

# Give at least one way that you can pivot the dataset to get a slightly different representation of some values. Explain what this is and how you would use it for a visualization.

One way that we can pivot the dataset is letting the itemDescription become the index, the Member_number the columns, and the Date as the values. This could give us a better representation of trends in certain items being purchased at various times of the year. We could use this pivoted data to get a running count of how many dates an specific item was purchased the most, and then see when those dates are to determine when items are more likely to be purchased within the year compared to others. This could be used in a line plot to show trends in purchase history of various items over a certain period of time. This would be great to use for finding out when certain stores may go out of stock on certain items as well as when to get ahead of the curve and buy the items you want before you run into a time where the item tends to be bought out in bulk more often.

# Identify any possible relationships between features included in the data: which ones are likely to affect others?

The main relationship within this data set is between the Date purchased and the itemDescription. The key relationship is the time of the year that people are more likely going to buy a certain item the most. For example, certain fruits might be bought at certain points in the year because they are "in season" or maybe chocolate is bought more around February because of Valentine's Day. This is a clear cause and effect relationship that can be shown below. 

# Show at least one plot or visualization to illustrate this (possible) relationship.

In [3]:
sub_data = grocery_data[['Date', 'itemDescription']]
sub_data[sub_data['itemDescription'] == 'chocolate'].groupby(by='Date').count()

Unnamed: 0_level_0,itemDescription
Date,Unnamed: 1_level_1
01-01-2014,1
01-01-2015,1
01-04-2015,1
01-05-2015,1
01-06-2015,1
01-07-2015,1
01-08-2014,1
01-08-2015,1
01-12-2014,1
02-03-2015,1


# What numerical or statistical techniques might you consider using to determine whether the relationship is reliable?

# Are there external inferences you think might be possible? For instance, can you hypothesize a relationship with data not included in the dataset? Why or why not?

# What “extra” features can you perhaps compute from the data? For example, if you have data that includes product dates of purchase, you can “engineer” the data to construct the most popular products over various lengths of time (e.g. a particular holiday season). How might you use this information? Using the holiday example, you might try to correlate holiday sales of a product to some mainstream event that popularized it.

# Dataset 2: The San Francisco Crime Dataset

In [4]:
crime_data = pd.read_csv('./SanFranciscoCrimeDataset/crime.csv',
                          header=1,
                          skipinitialspace=False,
                          names=['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time',
                                'PdDistrict', 'Resolution', 'Address', 'X', 'Y', 'Location', 'PdId'])
crime_data.head(10)

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
0,120058272,WEAPON LAWS,"FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE",Friday,01/29/2016 12:00:00 AM,11:00,SOUTHERN,"ARREST, BOOKED",800 Block of BRYANT ST,-122.403405,37.775421,"(37.775420706711, -122.403404791479)",12005827212168
1,141059263,WARRANTS,WARRANT ARREST,Monday,04/25/2016 12:00:00 AM,14:59,BAYVIEW,"ARREST, BOOKED",KEITH ST / SHAFTER AV,-122.388856,37.729981,"(37.7299809672996, -122.388856204292)",14105926363010
2,160013662,NON-CRIMINAL,LOST PROPERTY,Tuesday,01/05/2016 12:00:00 AM,23:50,TENDERLOIN,NONE,JONES ST / OFARRELL ST,-122.412971,37.785788,"(37.7857883766888, -122.412970537591)",16001366271000
3,160002740,NON-CRIMINAL,LOST PROPERTY,Friday,01/01/2016 12:00:00 AM,00:30,MISSION,NONE,16TH ST / MISSION ST,-122.419672,37.76505,"(37.7650501214668, -122.419671780296)",16000274071000
4,160002869,ASSAULT,BATTERY,Friday,01/01/2016 12:00:00 AM,21:35,NORTHERN,NONE,1700 Block of BUSH ST,-122.426077,37.788019,"(37.788018555829, -122.426077177375)",16000286904134
5,160003130,OTHER OFFENSES,PAROLE VIOLATION,Saturday,01/02/2016 12:00:00 AM,00:04,SOUTHERN,"ARREST, BOOKED",MARY ST / HOWARD ST,-122.405721,37.780879,"(37.7808789360214, -122.405721454567)",16000313026150
6,160003259,NON-CRIMINAL,FIRE REPORT,Saturday,01/02/2016 12:00:00 AM,01:02,TENDERLOIN,NONE,200 Block of EDDY ST,-122.411778,37.783981,"(37.7839805592634, -122.411778295992)",16000325968000
7,160003970,WARRANTS,WARRANT ARREST,Saturday,01/02/2016 12:00:00 AM,12:21,SOUTHERN,"ARREST, BOOKED",4TH ST / BERRY ST,-122.393357,37.775788,"(37.7757876218293, -122.393357241451)",16000397063010
8,160003641,MISSING PERSON,FOUND PERSON,Friday,01/01/2016 12:00:00 AM,10:06,BAYVIEW,NONE,100 Block of CAMERON WY,-122.387182,37.720967,"(37.7209669615499, -122.387181635995)",16000364175000
9,160086863,LARCENY/THEFT,ATTEMPTED THEFT FROM LOCKED VEHICLE,Friday,01/29/2016 12:00:00 AM,22:30,TARAVAL,NONE,1200 Block of 19TH AV,-122.477377,37.764478,"(37.7644781578695, -122.477376524003)",16008686306240


# What type of population is being sampled? What are the “things” getting measured – usually one per row of data.

The population of the data set is 150500 different crimes/incidents from San Francisco. Each data point gave extreme detail about where and when and some detail about what the specific incident was. We can look at this data and see many different things because we have many columns/details about one data point/incident. 

# What features does each sample have, i.e. what is being measured?

There are 

IncidntNum	Category	Descript	DayOfWeek	Date	Time	PdDistrict	Resolution	Address	X	Y	Location	PdId

There are eight features to each sample. 
#### Feature:
1. Incident Number - THe incedent number associated to incident - This number is seperate form the police's ID number but ID's the number 
2. Category - What type of incedent - This is a single to a few word that best describe he incedent that occured in the row
3. Descript - Description of the incident - This gives more detail than the category about the incident
4. Day Of Week/Date/Time - date and time when incident occured - These three columns give exact times, dates, and day of the weeks the row's incident occured 
5. PdDistrict - Police District where incendent occured - This is the district the incident occured in but in the eyes of the  police and their routing.
6. Resolution - Ending results - This is what happened to the party commiting the crime/causing the inccident 
7. Address X/Y/Location - Location Longitude, Latitude - This is the exact geological location of the inccident
8. PdId - Police Department Identification Number - This is the ID number but for the police books



# Are the features quantitative or qualitative? Ordinal or nominal? Continuous or discrete?

#### Feature:
1. Incident Number - 
2. Category - 
3. Descript - 
4. Day Of |Week/Date/Time - 
5. PdDistrict - 
6. Resolution - 
7. Address X/Y/Location - 
8. PdId - 

# Is the data “complete” or do some of the samples have null or absent values for certain features? Why are these samples still useful? Why are they incomplete?

The data is filled out. It may seem like the Resolution column may have missing data when it says "NONE," But this is an actual case. This is possible that the police were not able to charge anyone for the crome/incident committed. If we were to consider "NONE" as null, then these points are still valid. These "null" points have meaning because these locations have potential criminals that have experience in evading the law. 

# Why are these features chosen to be part of the dataset?

#### Feature:
1. Incident Number - 
2. Category - Similar amoung most of the population - Broad desciption of incidents - Allows for search of a inccident
3. Descript - Less frequent amoung the population - Specific exsamples of incidents - Does the same as Categories but much moire specific exsamples
4. Day Of Week/Date/Time - Tells us when incidents occured
5. PdDistrict - Says stuff about location from police side of things, typically same group of police patrol the same aprts of town. 
6. Resolution - Tells us what happened to the offenders in the end, if anything.
7. Address X/Y/Location - Geographic location of incident/Tells us where is happened.
8. PdId - Allows for future look up in police records. 

# What are some other features that are not included but that you think might make sense to include for this dataset?

Some other features that could have been added could have been the races of the perpetrators and victims. The addition of these new features adds some exciting relationships and statistics. We can show what race is causing the most considerable amount of crime. It can also show crime rates between races. It would also be interesting to see if the police knew it was gang violence or not. This may not be known for every incident, but it makes some fascinating data about gangs' locations for the ones that it is known. Disputes could be happening at HQ locations or on the edge of territories where we can then section the city species on a map into a gang "turfs."

# Give at least one way that you can pivot the dataset to get a slightly different representation of some values. Explain what this is and how you would use it for a visualization.

# Identify any possible relationships between features included in the data: which ones are likely to affect others?

Times and type of Incident, typically robberies and car jacking ahppens at night and not while tohe sun is up. 

# Show at least one plot or visualization to illustrate this (possible) relationship.

# What numerical or statistical techniques might you consider using to determine whether the relationship is reliable?

# Are there external inferences you think might be possible? For instance, can you hypothesize a relationship with data not included in the dataset? Why or why not?

#  What “extra” features can you perhaps compute from the data? For example, if you have data that includes product dates of purchase, you can “engineer” the data to construct the most popular products over various lengths of time (e.g. a particular holiday season). How might you use this information? Using the holiday example, you might try to correlate holiday sales of a product to some mainstream event that popularized it.

# Dataset 3: The Wine Quality Dataset

In [5]:
red_wine_data = pd.read_csv('./WineQualityDataset/winequality-red.csv',
                          header=1,
                          skipinitialspace=False,
                          delimiter=';',
                          names=['fixed acidity', 'volatile acidity', 'citric acid', 
                                 'residual sugar', 'chlorides', 'free sulfur dioxide',
                                 'total sulfur dioxide', 'density', 'pH', 'sulphates',
                                 'alcohol', 'quality'])
red_wine_data.head(100)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.9968,3.20,0.68,9.8,5
1,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.9970,3.26,0.65,9.8,5
2,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.9980,3.16,0.58,9.8,6
3,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
4,7.4,0.660,0.00,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5
5,7.9,0.600,0.06,1.6,0.069,15.0,59.0,0.9964,3.30,0.46,9.4,5
6,7.3,0.650,0.00,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7
7,7.8,0.580,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7
8,7.5,0.500,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.80,10.5,5
9,6.7,0.580,0.08,1.8,0.097,15.0,65.0,0.9959,3.28,0.54,9.2,5


In [6]:
white_wine_data = pd.read_csv('./WineQualityDataset/winequality-white.csv',
                          header=1,
                          skipinitialspace=False,
                          delimiter=';',
                          names=['fixed acidity', 'volatile acidity', 'citric acid', 
                                 'residual sugar', 'chlorides', 'free sulfur dioxide',
                                 'total sulfur dioxide', 'density', 'pH', 'sulphates',
                                 'alcohol', 'quality'])
white_wine_data.head(100)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,6.3,0.300,0.34,1.60,0.049,14.0,132.0,0.9940,3.30,0.49,9.5,6
1,8.1,0.280,0.40,6.90,0.050,30.0,97.0,0.9951,3.26,0.44,10.1,6
2,7.2,0.230,0.32,8.50,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6
3,7.2,0.230,0.32,8.50,0.058,47.0,186.0,0.9956,3.19,0.40,9.9,6
4,8.1,0.280,0.40,6.90,0.050,30.0,97.0,0.9951,3.26,0.44,10.1,6
5,6.2,0.320,0.16,7.00,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6
6,7.0,0.270,0.36,20.70,0.045,45.0,170.0,1.0010,3.00,0.45,8.8,6
7,6.3,0.300,0.34,1.60,0.049,14.0,132.0,0.9940,3.30,0.49,9.5,6
8,8.1,0.220,0.43,1.50,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6
9,8.1,0.270,0.41,1.45,0.033,11.0,63.0,0.9908,2.99,0.56,12.0,5


# What type of population is being sampled? What are the “things” getting measured – usually one per row of data.

# What features does each sample have, i.e. what is being measured?

# Are the features quantitative or qualitative? Ordinal or nominal? Continuous or discrete?

#  Is the data “complete” or do some of the samples have null or absent values for certain features? Why are these samples still useful? Why are they incomplete?

# Why are these features chosen to be part of the dataset?

# What are some other features that are not included but that you think might make sense to include for this dataset?

# Give at least one way that you can pivot the dataset to get a slightly different representation of some values. Explain what this is and how you would use it for a visualization.

#  Identify any possible relationships between features included in the data: which ones are likely to affect others?

#  Show at least one plot or visualization to illustrate this (possible) relationship.

# What numerical or statistical techniques might you consider using to determine whether the relationship is reliable?

# Are there external inferences you think might be possible? For instance, can you hypothesize a relationship with data not included in the dataset? Why or why not?

# What “extra” features can you perhaps compute from the data? For example, if you have data that includes product dates of purchase, you can “engineer” the data to construct the most popular products over various lengths of time (e.g. a particular holiday season). How might you use this information? Using the holiday example, you might try to correlate holiday sales of a product to some mainstream event that popularized it.