## North Carolina Public Schools Report
North Carolina Public Schools Report Card [1] and Statistical Profiles Databases [21] contain a large volume of information about public, charter, and alternative schools in the State of North Carolina. Information that is made publicly accessible comprises data at the school, district, and state levels.  This includes statistics on student and school performance, academic growth, diversity, safety, instructor experience levels, school funding, educational attainment, and much more.

## Data Overview

There were 17 tables of data reviewed within the North Carolina report card database. Two additional tables including racial composition statistics were subsequently located in `Statistical profiles database` [21].

 - **School Profile**: 
    This table contains profiles at the school, district, and state levels from 2006-2016.  Most data in the other database tables link to a single school profile contained in this table using the unit_code field.  Unit codes with the value “NC-SEA” represent profiles at the state level, and unit codes ending in “LEA” represent data at the district level.  Unit codes belonging to individual schools may be mapped to a given district using the first 3 characters of the unit code.  For example, schools belonging to the district “995LEA” will each have unit code that begins with “995”

### Importing data

In [8]:
import pandas as pd

# Read in the School Profile data
nc_profile = pd.ExcelFile('data/PROFILE.xlsx')

# type(nc_profile) returns: pandas.io.excel.ExcelFile
# print the sheets in the Excel file
# print(nc_profile.sheet_names) returns: ['PROFILE']

# Import sheets from Excel files as data frames.
nc_profile_df = nc_profile.parse('PROFILE')
nc_profile_df.shape


(32603, 35)

### Cleaning data

Data almost never comes in clean, so we need to prepare data for analysis. First step would be to diagnose data for problems.
 - Common data problems include:
     - Inconsistent column names
     - Missing data
     - Outliers
     - Duplicate rows
     - Untidy
     - Need to process columns
     - Columns types can signal unexpected data values

#### Inconsistent column names
Let's start by inspecting the first common data problem: Inconsistent column names. To do this, we need to get the list of column names from the dataframe `nc_profile_df`.

In [9]:
nc_profile_df.columns

Index(['vphone_ad', 'year', 'unit_code', 'street_ad', 'scity_ad', 'state_ad',
       'szip_ad', 'type_cd', 'closed_ind', 'new_ind', 'super_nm',
       'category_cd', 'url_ad', 'grade_range_cd', 'calendar_type_txt',
       'sna_pgm_type_cd', 'cover_letter_ad', 'school_type_txt',
       'calendar_only_txt', 'title1_type_cd', 'clp_ind', 'focus_clp_ind',
       'summer_program_ind', 'asm_no_spg_ind', 'no_data_spg_ind', 'Lea_Name',
       'School_Name', 'State_Name', 'esea_status', 'student_num',
       'lea_avg_student_num', 'st_avg_student_num', 'Grad_project_status',
       'stem', 'url'],
      dtype='object')

By visually inspecting the column names, it can be seen that these column names are in pretty good shape, i.e there are no spaces between the column names and most of them seem to be consistently named with `underscore` separating the words. The column names: `School_Name`, `State_Name`, `Lea_Name`, `Grad_project_status` appear to start with capital letters. We could choose to leave them as is or convert them to lower case to be consistent. 

#### Missing data
To get an idea about the columns with missing data, we can use the `info()` method. 

In [10]:
nc_profile_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32603 entries, 0 to 32602
Data columns (total 35 columns):
vphone_ad              32603 non-null object
year                   32603 non-null int64
unit_code              32603 non-null object
street_ad              32536 non-null object
scity_ad               32536 non-null object
state_ad               32536 non-null object
szip_ad                32536 non-null float64
type_cd                32536 non-null object
closed_ind             32536 non-null float64
new_ind                32536 non-null float64
super_nm               5625 non-null object
category_cd            32510 non-null object
url_ad                 31138 non-null object
grade_range_cd         27589 non-null object
calendar_type_txt      27577 non-null object
sna_pgm_type_cd        27524 non-null object
cover_letter_ad        15956 non-null object
school_type_txt        27577 non-null object
calendar_only_txt      27534 non-null object
title1_type_cd         12776 non-nu

From the output it is evident that many columns have missing data in them. Missing data may not necessarily mean that we have an issue, there are a myriad of reasons for missing data. Each column needs to be investigated.