# Checkpoint Two: Exploratory Data Analysis

Now that your chosen dataset is approved, it is time to start working on your analysis. Use this notebook to perform your EDA and make notes where directed to as you work.

## Getting Started

Since we have not provided your dataset for you, you will need to load the necessary files in this repository. Make sure to include a link back to the original dataset here as well.

My dataset:

Your first task in EDA is to import necessary libraries and create a dataframe(s). Make note in the form of code comments of what your thought process is as you work on this setup task.

In [28]:
# Imported pandas library seems to work well. I also added numpy, seaborn, and matplotlib to aid with data visualizations. 
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from matplotlib import style

style.use('ggplot')
plt.rcParams['figure.figsize'] = (20,10)

# Created a dataframe using pd for the mushrooms, which is a single csv file/dataframe.
shrooms = pd.read_csv("mushrooms.csv")
shrooms

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


## Get to Know the Numbers

Now that you have everything setup, put any code that you use to get to know the dataframe and its rows and columns better in the cell below. You can use whatever techniques you like, except for visualizations. You will put those in a separate section.

When working on your code, make sure to leave comments so that your mentors can understand your thought process.

In [18]:
# info to show an overview of the dataset. It shows that there are 8124 non-null count rows of data, shows that the datatype is set as object, and also shows the full names of the columns, of which there are 23. 
shrooms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [17]:
# nunique to show how many variations there are in each mushroom category. There are 2 for class, edible=e, poisonous=p and so on. Gill color has the most variety with 12 different variations. 
shrooms.nunique()

class                        2
cap-shape                    6
cap-surface                  4
cap-color                   10
bruises                      2
odor                         9
gill-attachment              2
gill-spacing                 2
gill-size                    2
gill-color                  12
stalk-shape                  2
stalk-root                   5
stalk-surface-above-ring     4
stalk-surface-below-ring     4
stalk-color-above-ring       9
stalk-color-below-ring       9
veil-type                    1
veil-color                   4
ring-number                  3
ring-type                    5
spore-print-color            9
population                   6
habitat                      7
dtype: int64

In [19]:
# descrive to show a statistical sumary. 
shrooms.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [20]:
# head to show the first 5 rows. Can alson use shrooms.tail() can be used to show last 5 rows.
shrooms.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [21]:
# value counts to show the number of mushrooms that share the same gill color. B for buff, for example, is the gill color that is seen the most frequently in the data set with 1728 mushrooms having this color. 
# The least common color found is r for green with 24. This code can be changed to show value counts for any of the other columns. 
shrooms["gill-color"].value_counts()
value_counts = shrooms["gill-color"].value_counts()
value_counts

b    1728
p    1492
w    1202
n    1048
g     752
h     732
u     492
k     408
e      96
y      86
o      64
r      24
Name: gill-color, dtype: int64

In [22]:
# value counts to show counts for mushroom habitats
shrooms["habitat"].value_counts()
value_counts = shrooms["habitat"].value_counts()
value_counts

d    3148
g    2148
p    1144
l     832
u     368
m     292
w     192
Name: habitat, dtype: int64

## Visualize

Create any visualizations for your EDA here. Make note in the form of code comments of what your thought process is for your visualizations.

In [45]:
# Vertical bar chart
style.use('ggplot')
shrooms_habitat = pd.DataFrame(shrooms.loc["Class",habitat].sum())
shrooms_habitat.rename(columns = {0:'total'}, inplace = True)

shrooms_habitat.plot(kind = 'bar', legend = False)
plt.title('Poisonous Mushrooms Found by Habitat',color = 'black')
plt.xticks(color = 'black')
plt.yticks(color = 'black')
plt.xlabel('Habitat',color = 'black')
plt.ylabel('Poisonous Mushrooms',color = 'black')
plt.savefig('bar_vertical.png')

plt.show()

# Where do i put code to distinguish the class? Edible vs. poisonous?


NameError: name 'habitat' is not defined

In [50]:
# Grouped bar chart
habitat_location = list(map(str, ('grasses','leaves','meadows','paths','urban','waste','woods')))
shrooms_group = pd.DataFrame(shrooms.loc[['edible', 'poisonous'],habitat_location].T)

shrooms_group.plot.bar(edgecolor = 'white')
plt.title('Poisonous Mushrooms Found by Habitat',color = 'black')
plt.xticks(color = 'black')
plt.yticks(color = 'black')
plt.xlabel('Class',color = 'black')
plt.ylabel('Number of Mushrooms',color = 'black')
plt.legend(title = 'Habitats', fontsize = 12)
plt.savefig('bar_grouped.png')

plt.show()

KeyError: "None of [Index(['edible', 'poisonous'], dtype='object')] are in the [index]"

## Summarize Your Results

With your EDA complete, answer the following questions.

1. Was there anything surprising about your dataset? 
2. Do you have any concerns about your dataset? 
3. Is there anything you want to make note of for the next phase of your analysis, which is cleaning data? 

In [None]:
# I am having difficulty with the visualizations for this assignment :(
# 1. I didn't find anything surprising about this dataset. However, there are almost as many poisonous mushrooms as edible ones and I think that is interesting. 
# 2. The main concer I have about the dataset is using alias for the data. 
# 3. I need to ask for some extra help so that I can get the visualizations down and confirm my business issue. 