# Loading and Viewing your data


## Diagnose the data

### Import pandas

In [None]:
import pandas as pd

### Read the file into a DataFrame: df

you're going to look at a subset of the Department of Buildings Job Application Filings dataset from the NYC Open Data portal. This dataset consists of job applications filed on January 22, 2017.

In [None]:
df = pd.read_csv('dob_job_application_filings_subset.csv')

### print the head and tail of df
Inspecting the DataFrame with the .head() and .tail() methods. 

In [None]:
print(df.head())

### print the tail of df

In [None]:
print(df.tail())

### print the shape of df

The .shape and .columns attributes let you see the shape of the DataFrame and obtain a list of its columns. To this end, a new DataFrame, df_subset, consisting only of these relevant columns, has been pre-loaded. This is the DataFrame you'll work with in the rest of the chapter.

In [None]:
print(df.shape)

### print the columns of df

In [None]:
print(df.columns)

## More diagnosis
### print the info of df
The .info() method provides important information about a DataFrame, such as the number of rows, number of columns, number of non-missing values in each column, and the data type stored in each column. This is the kind of information that will allow you to confirm whether the 'Initial Cost' and 'Total Est. Fee' columns are numeric or strings. From the results, you'll also be able to see whether or not all columns have complete data in them.

The full DataFrame df and the subset DataFrame df_subset have been pre-loaded. Your task is to use the .info() method on these and analyze the results.

In [None]:
print(df.info())


## Calculating summary statistics

### You'll now use the .describe() method to calculate summary statistics of your data.

In [None]:
print(df.describe())

## Frequency counts for categorical data
As you have seen, .describe() can only be used on numeric columns. So how can you diagnose data issues when you have categorical data? One way is by using the .value_counts() method, which returns the frequency counts for each unique value in a column!

### print the value counts for 'Borough'

In [None]:
print(df['Borough'].value_counts(dropna=False))

### print the value counts for 'State'

In [None]:
print(df['State'].value_counts(dropna=False))

### print the value counts for 'Site Fill'

In [None]:
print(df['Site Fill'].value_counts(dropna=False))

#### Notice how not all values in the 'State' column are NY. This is an interesting find, as this data is supposed to consist of applications filed in NYC. Curiously, all the 'Borough' values are correct. A good start as to why this may be the case would be to find and look at the codebook for this dataset. Also, for the 'Site Fill' column, you may or may not need to recode the NOT APPLICABLE values to NaN in your final analysis.

## Visualizing single variables with histograms

The .plot() method allows you to create a plot of each column of a DataFrame. The kind parameter allows you to specify the type of plot to use - kind='hist', for example, plots a histogram.

### Import matplotlib.pyplot

In [None]:
import matplotlib.pyplot as plt

### Plot the histogram
Create a histogram of the 'Existing Zoning Sqft' column. Rotate the axis labels by 70 degrees and use a log scale for both axes.

In [None]:
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)

### Display the histogram
While visualizing your data is a great way to understand it, keep in mind that no one technique is better than another. As you saw here, you still needed to look at the summary statistics to help understand your data better. You expected a large amount of counts on the left side of the plot because the 25th, 50th, and 75th percentiles have a value of 0. The plot shows us that there are barely any counts near the max value, signifying an outlier.

In [None]:
plt.show()

## Visualizing multiple variables with boxplots
Histograms are great ways of visualizing single variables. To visualize multiple variables, boxplots are useful, especially when one of the variables is categorical.

Using the .boxplot() method of df, create a boxplot of 'initial_cost' across the different values of 'Borough'.

In [None]:
df.boxplot(column='ExistingNo. of Stories',by='Borough')

## Visualizing multiple variables with scatter plots
Using df, create a scatter plot (kind='scatter') with 'initial_cost' on the x-axis and the 'total_est_fee' on the y-axis. Rotate the x-axis labels by 70 degrees.

In [None]:
df.plot(kind='scatter', x='ExistingNo. of Stories', y='Proposed No. of Stories', rot=70)
plt.show()