In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

In [None]:
#important libraries 
import numpy as np 
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt




In [None]:
titanic = pd.read_csv("../input/titanic/train.csv")
sf_salaries = pd.read_csv("../input/sf-salaries/Salaries.csv")

In [None]:
titanic.head()
sf_salaries.head()

## Motivation
This notebook is inspired from the course [Understanding and Visualizing Data](https://www.coursera.org/learn/understanding-visualization-data) which I have taken. I wanted to share my knowledge. If I tried to put all my knowledge about data visualization in here I think It will be a large notebook and ambiguous for beginners to understand. That’s why I am going to write four notebook to maintain continuity. 
First part of this notebook I [Intro to Data(Descriptive Statistics Part-I)](https://www.kaggle.com/shuchita008/introduction-to-data-with-python-part-i)

## Objective 
In this notebook, I will be showing graphical and numerical interpretations for univariate data. In particular, I will be creating and analyzing **histograms**, **box plots**, and **numerical summaries** of our data, in order to give basis of analysis for quantitative data and **bar charts** and **pie charts** for categorical data.


# Univariate Data
Univariate is the simplest form of statistical analysis. It can be descriptive or inferential. Uni means “one”. So in this type analysis data contains only one variable. Univariate data analysis don’t need to find relation between variable. It purpose is to summaries that one variables' data and find out the pattern that exist in that data.

**Example:** Suppose you have children **weight** data reading in a school. Now you want to add another variable, **Age** to that data. In Univariate Descriptive analysis it will not find any relation if weight is related to age or not. It will simply summaries the weight data to find the pattern. 

Way to describe Univariate data include Central Tendency: mean, mode and median, and dispersion: range, variance, maximum, quartiles and standard deviation.


# Catagorical Data: Frequency Table

Most common way to summaries [categorical data](https://www.kaggle.com/shuchita008/introduction-to-data-with-python-part-i#Types-of-Data.) by using frequency table used. Frequency table observe number of time a data has occured in the dataset.  Here is the list of passengers name and their ticket class in titanic.

In [None]:
titanic.loc[:5,["Name", "Pclass"]]

**Pclass:**
* 1 = 1st class
* 2 = 2nd class
* 3 = 3rd class.

Pclass is a categorical for sure because the number doesn't represent a numerical value. The number divide passangers class. Lets make a frequency table to see how the passenger class are divided in titanic.

## .crosstab()

To find out pandas Frequency of a DataFrame we can use **.crosstab()**. 
>**crosstab(index, column)**


**index**   = name of the column to group by.

**column** = finding out frequency of the given column.


In [None]:
fre_table = pd.crosstab(titanic["Pclass"], 'count')
fre_table.columns= ['count']
fre_table

Lets check the percentage.

In [None]:
number_of_passanger = len(titanic)
fre_table["Percentage"] = (fre_table*100)/ number_of_passanger
fre_table

### One Sentence Summary: " 24% passenger in titanic were in 1st class, 20% in 2nd class and 55% in 3rd class."

# Visualize Catagorical Data : Bar Chart
For visualizing categorical data most common ways is Bar chart.The bar chart displays data using a number of bars, each representing a particular category. The height of each bar is proportional to a specific aggregation.

In [None]:
import seaborn as sns # For plotting
from matplotlib import pyplot as plt


In [None]:
df = pd.DataFrame({"Pclass" :["0","1","2", "3"]})
df["Frequency"] = pd.crosstab(titanic["Pclass"], 'count')

df.plot.bar(x='Pclass', y='Frequency', rot=0)

plt.xlabel("Ticket Class")
plt.ylabel("Number of people")

## Visualize Catagorical Data: Pie Chart
Pie chart is also way to visualize catagorical Data. But If we want to use pie chart it should be use with caustion.Because for small portion of data level can be overlap in pie. Here is an example of level overlapping in a pi chart.
![image.png](attachment:image.png)

# Quantitative Data: Histogram
### What is a histogram?
A histogram is a representation of numerical data. It is similar to Bar chart but it deals with Quantitative data. A histogram use bin to show number of item in a category. Bin could be thick or thin. Bins’ wide depend which software user is using.

![image.png](attachment:image.png)

### Main Aspect
 Histogram has 4 main aspect which can help to summaries a quantitative data. 
*      **Shape:** Overall appearance of histogram. Ex: Bell Shaped, Right skewed, Left Skewed etc.
*      **Center:** Mean and Median.
*      **Spread:** How far the data is spread. Range, Interquartile range
*      **Outliers:** Data point that far away from bulk of data.

First we will see three examples and going to use this 4 aspect to summaries data. Then we are going to apply that knowledge in titanic data.
>>
### Example-1: Adult female height distribution.
![image.png](attachment:image.png)
Source: https://rstudio-pubs-static.s3.amazonaws.com/212500_14180eb375d44d3c83556ddc85591bfe.html

**Shape:** **Bell Shaped**, **uni modal**  because it has only one hump.

**Center:** 165cm. As it is a bell shaped curved so the mean and median will be same .would write in the middle. 

**Spread:**  Range = Max - Min
                  = (185 – 145)cm 
                  = 40 cm
                  
**Outliers:** All data are pretty concise so apparently there is no outliers.
### One Sentence Summary: The distribution of adult female height is roughly bell shaped with a center of about 165cm, a range of 40 cm (145 to 185), and no apparent outliers.

### Example-2: Histogram of Salaries in San Francisco
![image.png](attachment:image.png)


**Shape:** Bi Modal, Right skewed.

**Center:**  Median = For center we are going to draw a vertical line that split our bin equally. I am guessing 80,000.                For Mean it is going to be greater that median because right skewed. So I am roughly guessing from histogram                85,000.

**Spread:** Range = (Max – Min)
                    = 600,000 - 0 
                    = 600,000 dollar
**Outliers:** Higher End.                                                         

### One sentence summary: The distribution of salaries in San Francisco is bimodal and skewed to the right, centered at about 80,000 dollar with most of the data between 40,000 and 120,000 dollar, a range of roughly 600,000 dollar, and outliers are present on the higher end.

### Example-3 : Students Exam Score
![image.png](attachment:image.png)
Source: https://www.coursera.org/learn/understanding-visualization-data
### One sentence summary:The distribution of exam scores is skewed left, centered at about 80 points with most scores being between 65 and 90 points, a range of roughly 85, and some outliers are present below 50 points. 


### Practice : Peoples'age Boarded in titanic

In [None]:
sns.distplot(titanic["Age"],kde = False).set_title("Passengers Age in titanic")
plt.show()

![image.png](attachment:image.png)

**Shape:** Bimodal, right skewed

**Center:** Median= 25 year, Mean = 35 year.

**Spread:** range = Max – Min
                  = 80 – 2
                   = 78 year
                        
**Outliers** = High End.

### Summary: The distribution of passengers’ age in titanic is bimodal and skewed to the right, centered at about 25 year, where most of the data are present between 15 year to 50 year, a range of 78 year and outliers present in higher end.

# Quantitative Data: Numerical Summaries
Previous section we saw how we can put together a summaries using histogram. As it was a rough summaries we are going to use a very precise summary using five number summaries.

> Min              = minimum value in the data.

> 1st Quartile(Q1) = 25% data falls below the value.

> Median(Q2)       = 50% data fall below the value.

> 3rd Quartile(Q3) = 75% data fall below the value. 

> Max              = maximum value in the data.

 Here is the five number summaries of passengers’ age in titanic data set. Code.  


In [None]:
sf_salaries["TotalPayBenefits"].describe()

# Quantitative Data: Boxplots

Boxplots are another way to show graphical summaries of quantitative data. Previous section we learn about five number summaries. Boxplots take this five number summaries and give us a visual picture of it.
![image.png](attachment:image.png)
Here Q1 and Q2, the two quartiles. These are the 25th and 75th percentile of our data and so the distance between them gives us that middle 50 percent for how spread out the middle 50 percent of our data is. So basically the length of the box is our interquartile range. The final two numbers are the minimum and maximum value of the data. The median location would be marked in the middle location, and the min and max would be drawn by taking a line. These lines are sometimes called whiskers.



### Example-1: Salary in San Francisco
![](http://)

In [None]:
sns.boxplot(sf_salaries["TotalPayBenefits"]).set_title("Box plot of the total Benefits")
plt.show()



This plot definitely not symmetric, in fact it is skewed to the right. If we recall our [Exmaple-2](https://www.kaggle.com/shuchita008/univariate-data-descriptive-statistics-part-i/###Example-2:Histogram-of-Salaries-in-San-Francisco) in histogram section we also saw a tail in right.Our boxplot also showing us those few unusually large values of the same data. Boxplot have technique for identifying outliers, and those points are plotted separately for us to see. That is the most interesting part of out data.
