This is an assignment on Pandas and Matplotlib. These are important tools that will be used in upcoming assignments.

# Import Libraries

We shall start by importing the Python libraries you will need to run the code.

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib

%matplotlib inline

[Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) and [matplotlib](https://www.tutorialspoint.com/numpy/numpy_matplotlib.htm) go through these tutorials before you start doing the assignment.

## Load the Data
### About the data

The U.S. Census Bureau began asking internet use in American Community Survey (ACS) in 2013, as part of the 2008 Broadband Data Improvement Act, and has published 1-year estimate each year since 2013. The recent 2016 data shows that in many counties, over a quarter of household still do not have internet access.

In [None]:
#Here we load the data
df = pd.read_csv("kaggle_internet.csv")
print('shape of the data:'+str(df.shape))
df.head()

## Working with the data

### 1.1 Check the correlation between the column using correlation matrix

In [None]:
#let's write the code for correlation matix
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

### Correlation
Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.
Check out this [blog](https://www.datascience.com/learn-data-science/fundamentals/introduction-to-correlation-python-data-science) have a good mathematical explanation of correlation and here are the [types](http://benalexkeen.com/correlation-in-python/) of it.



### Drop unncessary columns
Here some columns have high similarity just like correlation between P_white and P_total is 0.9719 so can remove P_white, it won't affect the information extracted from the data.Generally, we do this if we have unncessary columns especially if the dimensions are very high, for example there are 1500 columns. We do this to decrease the computation cost.

In [None]:
#Using the correlation scores make a list of columns to drop and drop the columns.
drop_list=drop_list= [column for column in corr if corr[column]>0.95]
df.drop('drop_list', axis=1, inplace=True)



## 2.0 Plot the data

After loading the data and removing unnecessary columns now we are going to visualize the data by scatter plot, line plot, histogram plot and few more using [matplotlib](https://www.tutorialspoint.com/numpy/numpy_matplotlib.htm) library.
We will define functions for each plot to make our code re-usable.

But before plotting, you must set some parameters common to all the plots. This is done in order to make the visibility and readability of the plots better. This can be done by changing the default rc settings defined in the matplotlibrc configuration files. One example is given.

In [None]:
#Set plot size to 14" x 7" (Solved Example)
matplotlib.rc('figure', figsize = (14, 7))

matplot.rcParams['font.size']=15
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)


## 2.1 Scatter Plot

This is the most used plot for gaining insight into data while dealing with Machine Learning problems. For your convenience, the code for scatter plot is given as an example.

In [None]:
def scatter_plot(x_data, y_data, x_label, y_label, title):
    
    plt.scatter(x_data, y_data, s = 15, color = '#539caf', alpha = 0.75)
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    
    
scatter_plot(x_data = df['lat']
            , y_data = df['lon'] 
            , x_label = 'lat'
            , y_label = 'lon'
            , title = 'lon vs lat')

## 2.2 Histogram

Now here your task is to complete this histogram function

In [None]:
#Fill the function given below
def histogram(data, x_label, y_label, title):
plt.hist(data)
plt.title(title)
plt.xlabel(x_label)
plt.ylabel(y_label)


#Calling the function "histogram"
histogram(data = df['percent_no_internet']
           , x_label = '%_no_internet'
           , y_label = 'Frequency'
           , title = 'Distribution of percent of household without internet connection')

## 2.3 Line plot

Import the data for line plot.

In [None]:
df2=pd.read_csv('day.csv')
df2.head()

Similarly complete the lineplot function

In [None]:
#Fill the function given below
def lineplot(x_data, y_data, x_label, y_label, title):
#Begin the code
plt.title(title)
plt.xlabel(x_label)
plt.ylabel(y_label)
 plt.plot(x_data, y_data)


#End the code    
#Calling the function to create plot
lineplot(x_data = df2['dteday']
         , y_data = df2['cnt']
         , x_label = 'Date'
         , y_label = 'Total Checkouts'
         , title = 'Total Checkouts for Each Day')

## 2.4 Subplots

The Matplotlib subplot() function can be called to plot two or more plots in one figure. Matplotlib supports all kind of subplots including 2x1 vertical, 2x1 horizontal or a 2x2 grid.
Sometimes there is a need to plot subplot. Here, for example, we plot histogram and scatter plot together.

In [None]:
plt.subplot(211)
scatter_plot(x_data = df['lat']
            , y_data = df['lon'] 
            , x_label = 'lat'
            , y_label = 'lon'
            , title = 'lon vs lat')
plt.subplot(212)
histogram(data = df['percent_no_internet']
           , x_label = '%_no_internet'
           , y_label = 'Frequency'
           , title = 'Distribution of percent of household without internet connection')

Here, your task is to write the code for subplot having two line plots.

In [None]:
#Complete the function
def line_2subplot(x1_data, x1_label, y1_data, y1_label,title1, x2_data, x2_label, y2_data, y2_label,title2):
#Begin the code
plt.subplot(211)
lineplot(x1_data, y1_data, x1_label, y1_label, title1)
plt.subplot(212)
lineplot(x2_data, y2_data, x2_label, y2_label, title2)


#End the code        
#Calling the function to create subplot
line_2subplot(x1_data = df2['dteday']
              ,x1_label = 'Date',y1_data = df2['casual']
              , y1_label = 'Casual Checkouts'
              ,title1 = 'Total Casual Checkouts for Each Day'
              ,x2_data = df2['dteday'] ,x2_label = 'Date',y2_data = df2['registered']
              ,y2_label = 'Registered Checkouts'
              ,title2 = 'Total Registered Checkouts for Each Day')

    


## 2.5 Comparing Two Line Plots

This is useful for comparing two variables over a third variable. It must be noted that these variables may have different scales (they do in the example given below).

In [None]:
#Fill the function given below
def lineplot2y(x_data, x_label, y1_data, y1_label, y2_data, y2_label, title):
#Begin the code
 plt.subplot(211)
lineplot(x_data,y1_data, x_label, y1_label, title)
 plt.subplot(212)
lineplot(x_data, y2_data, x_label, y2_label, title)


#End the code       
#Calling the function to create plot
lineplot2y(x_data = df2['dteday']
           , x_label = 'Date'
           , y1_data = df2['casual']
           , y1_label = 'Casual Checkouts'
           , y2_data = df2['registered']
           , y2_label = 'Registered Checkouts'
           , title = 'No. of Registered vs Casual Checkouts for Each Day')


# It's Finished
Congratulations! you have completed your first assignment.There are lots of plots you can explore, these plots will help you to visualize data in your upcoming assignment.You can also perform numpy operations on the given data.


PS:You need to submit this Python notebook for this part of the assignment.