# Obtaining Data Types in a Matrix Column

### Import the Packages:

In [1]:
import pandas as pd
import numpy as np
import os 

### Load the Dataset

In [4]:
filename = os.path.join("/Users/salmanyagaka/Documents/interviews/adult.csv")
df = pd.read_csv(filename, header=0)
df

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### Inspect the Data 
Use the `head()` method to inspect DataFrame `df`.

In [7]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


### Get summary statistics by column using Pandas `describe()` Method

One useful way to quickly overview data and get insight into key statistics for each column is to use the Pandas DataFrame `describe()` method. Run the cell below to get more information about `describe()`. You can also access the online [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html).

In [8]:
df.describe?

[0;31mSignature:[0m [0mdf[0m[0;34m.[0m[0mdescribe[0m[0;34m([0m[0mpercentiles[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0minclude[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mexclude[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m [0;34m->[0m [0;34m'Self'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Generate descriptive statistics.

Descriptive statistics include those that summarize the central
tendency, dispersion and shape of a
dataset's distribution, excluding ``NaN`` values.

Analyzes both numeric and object series, as well
as ``DataFrame`` column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.

Parameters
----------
percentiles : list-like of numbers, optional
    The percentiles to include in the output. All should
    fall between 0 and 1. The default is
    ``[.25, .5, .75]``, which returns the 25th, 50th, and
    75th percentiles.
include : 'all', list-like of dtypes or None (default),

The code cell below runs the `describe()` method on DataFrame `df`. 

In [9]:
df.describe()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


###  Get the Data Types for all Columns using Pandas `dtypes` Property.

Note that some columns are excluded from the summary statistics above. This is because by default, the `decribe()` method only includes numerically valued columns. You can inspect the data type of a column's values by using the `dtypes` property. Run the code cell below and inspect the results.

In [10]:
df.dtypes

age                 int64
workclass          object
fnlwgt              int64
education          object
educational-num     int64
marital-status     object
occupation         object
relationship       object
race               object
gender             object
capital-gain        int64
capital-loss        int64
hours-per-week      int64
native-country     object
income             object
dtype: object

Let's take a closer look at the results.
Even if you are familiar with the data types in python, the results above may seem confusing. For example, what is an `object` type?
Not to worry: Pandas uses its own convention for referring to data types. Here is a simple table to help you map Pandas data types to other data types:

<table>
  <tr>
    <th>Pandas dtype       </th>
    <th>Python type        </th>
    <th>NumPy type         </th>   
      <th>Usage</th>
      <tr><td>object</td><td>str or mixed</td><td>string_, unicode_, mixed types</td><td>Text or mixed numeric and non-numeric values</td><tr>
<tr><td>int64	</td><td>int</td><td>int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64</td><td>Integer numbers</td></tr>
<tr><td>float64</td><td>float</td><td>float_, float16, float32, float64</td><td>Floating point numbers</td></tr>
<tr><td>bool</td><td>bool</td><td>bool_</td><td>True/False values</td></tr>
<tr><td>datetime64</td><td>NA</td><td>datetime64[ns]</td><td>Date and time values</td></tr>

  </tr>
     <tr>
      <td>category</td>	
      <td>NA	</td>
      <td>NA	</td>
      <td>Finite list of text values</td>
  </tr>
  <tr>
    <td>timedelta[ns]</td>
    <td>NA</td>
       <td>NA</td>
    <td>Differences between two datetimes</td>
  </tr>
  <tr>
      <td>category</td>	
      <td>NA	</td>
      <td>NA	</td>
      <td>Finite list of text values</td>
  </tr>
    
</table>



In the cell below, call `df.describe()` with the parameter `include='all'` . This will produce summary statistics for all columns in DataFrame `df`. Examine the results. The `describe()` method now produces a quick and easy way to access balance with regard to the label, sex, race, and other columns containing string
values.
In particular, observe the values in `count`, `unique`, and `top`  for the `label` column:
our dataset does not appear to have a stark imbalance of one of the label classes.

In [11]:
df.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
count,48842.0,48842,48842.0,48842,48842.0,48842,48842,48842,48842,48842,48842.0,48842.0,48842.0,48842,48842
unique,,9,,16,,7,15,6,5,2,,,,42,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,,15784,,22379,6172,19716,41762,32650,,,,43832,37155
mean,38.643585,,189664.1,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,105604.0,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117550.5,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178144.5,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237642.0,,12.0,,,,,,0.0,0.0,45.0,,


### A More Detailed Way to Read Column Types using `pd.api.types.infer_dtype()`

The code cell below creates a dictionary in which each key corresponds to a column name and each value corresponds to its data type. It uses the function `pd.api.types.infer_dtype()` to find the data type of each column. Run the cell below and inspect the results.

In [14]:
types_dict = {}
print(df.columns)
print(df.shape[0])
print(df.shape[1])
for column in df.columns:
    types_dict[column] = pd.api.types.infer_dtype(df[column])

types_dict

Index(['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')
48842
15


{'age': 'integer',
 'workclass': 'string',
 'fnlwgt': 'integer',
 'education': 'string',
 'educational-num': 'integer',
 'marital-status': 'string',
 'occupation': 'string',
 'relationship': 'string',
 'race': 'string',
 'gender': 'string',
 'capital-gain': 'integer',
 'capital-loss': 'integer',
 'hours-per-week': 'integer',
 'native-country': 'string',
 'income': 'string'}