## Instructions:

1. Whenever possible, please provide output along with your code. 
2. It is best practice to re-run everything before submitting it. This serves as a final check to ensure there are no errors in your code or output.

#### Import the standard libraries

In [1]:
import numpy as np
import pandas as pd
import datetime as dt

#### Bring in the csv file titled `fina.csv`.  Name the Dataframe: `comp`

In [2]:
comp = pd.read_csv('fina.csv', index_col=False, parse_dates=['datadate'] )
comp.head()

Unnamed: 0,gvkey,datadate,fyr,tic,cik,conm,ni,sale,cogs,oancf,at,lt,seq
0,2575,2012-12-31,12,MTD,1037646,Mettler Toledo,290.847,2341.528,811.204,327.704,2117.4,1290.181,827.219
1,2575,2013-12-31,12,MTD,1037646,Mettler Toledo,306.094,2378.972,794.915,345.928,2152.819,1217.767,935.052
2,2575,2014-12-31,12,MTD,1037646,Mettler Toledo,338.241,2485.983,809.537,418.912,2009.11,1289.515,719.595
3,2575,2015-12-31,12,MTD,1037646,Mettler Toledo,352.82,2395.447,744.867,426.868,2018.485,1438.028,580.457
4,2575,2016-12-31,12,MTD,1037646,Mettler Toledo,384.37,2508.257,767.753,443.078,2166.777,1731.834,434.943


#### Do the following: 
 - i) print out the column list 
 - ii) create a list of columns named `cols`

In [3]:
comp.columns

Index(['gvkey', 'datadate', 'fyr', 'tic', 'cik', 'conm', 'ni', 'sale', 'cogs',
       'oancf', 'at', 'lt', 'seq'],
      dtype='object')

In [4]:
cols = comp.columns.to_list()
cols

['gvkey',
 'datadate',
 'fyr',
 'tic',
 'cik',
 'conm',
 'ni',
 'sale',
 'cogs',
 'oancf',
 'at',
 'lt',
 'seq']

#### Identify which variables are missing values using two different approaches

In [5]:
comp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3502 entries, 0 to 3501
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   gvkey     3502 non-null   int64         
 1   datadate  3502 non-null   datetime64[ns]
 2   fyr       3502 non-null   int64         
 3   tic       3502 non-null   object        
 4   cik       3502 non-null   int64         
 5   conm      3502 non-null   object        
 6   ni        3502 non-null   float64       
 7   sale      3502 non-null   float64       
 8   cogs      2322 non-null   float64       
 9   oancf     3122 non-null   float64       
 10  at        3475 non-null   float64       
 11  lt        3473 non-null   float64       
 12  seq       3498 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(3), object(2)
memory usage: 355.8+ KB


In [6]:
comp.isnull().sum()

gvkey          0
datadate       0
fyr            0
tic            0
cik            0
conm           0
ni             0
sale           0
cogs        1180
oancf        380
at            27
lt            29
seq            4
dtype: int64

#### Using `.apply`, convert the following:
 - `gvkey` to 6-digit text string
 - `cik` to a 10-digit text string.
 


In [7]:
comp['gvkey'] = comp['gvkey'].apply('{:0>6}'.format)
comp['cik'] = comp['cik'].apply('{:0>10}'.format)
comp.head()

Unnamed: 0,gvkey,datadate,fyr,tic,cik,conm,ni,sale,cogs,oancf,at,lt,seq
0,2575,2012-12-31,12,MTD,1037646,Mettler Toledo,290.847,2341.528,811.204,327.704,2117.4,1290.181,827.219
1,2575,2013-12-31,12,MTD,1037646,Mettler Toledo,306.094,2378.972,794.915,345.928,2152.819,1217.767,935.052
2,2575,2014-12-31,12,MTD,1037646,Mettler Toledo,338.241,2485.983,809.537,418.912,2009.11,1289.515,719.595
3,2575,2015-12-31,12,MTD,1037646,Mettler Toledo,352.82,2395.447,744.867,426.868,2018.485,1438.028,580.457
4,2575,2016-12-31,12,MTD,1037646,Mettler Toledo,384.37,2508.257,767.753,443.078,2166.777,1731.834,434.943


In [8]:
comp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3502 entries, 0 to 3501
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   gvkey     3502 non-null   object        
 1   datadate  3502 non-null   datetime64[ns]
 2   fyr       3502 non-null   int64         
 3   tic       3502 non-null   object        
 4   cik       3502 non-null   object        
 5   conm      3502 non-null   object        
 6   ni        3502 non-null   float64       
 7   sale      3502 non-null   float64       
 8   cogs      2322 non-null   float64       
 9   oancf     3122 non-null   float64       
 10  at        3475 non-null   float64       
 11  lt        3473 non-null   float64       
 12  seq       3498 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(1), object(4)
memory usage: 355.8+ KB


#### Drop duplicates in `comp` and check how many observations get dropped

In [9]:
comp = comp.drop_duplicates()
comp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3502 entries, 0 to 3501
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   gvkey     3502 non-null   object        
 1   datadate  3502 non-null   datetime64[ns]
 2   fyr       3502 non-null   int64         
 3   tic       3502 non-null   object        
 4   cik       3502 non-null   object        
 5   conm      3502 non-null   object        
 6   ni        3502 non-null   float64       
 7   sale      3502 non-null   float64       
 8   cogs      2322 non-null   float64       
 9   oancf     3122 non-null   float64       
 10  at        3475 non-null   float64       
 11  lt        3473 non-null   float64       
 12  seq       3498 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(1), object(4)
memory usage: 355.8+ KB
