# Read all columns as strings
- With Big Data data sources, sometimes auto inference of data types can become nuisance
- For e.g. auto inference of data types scans a certain number of rows to `infer` data type, but that might be wrong because there might be different data types in the same column later. Date types, when there are invalid dates in the millionth row. First 10k rows might contain integers, but later rows might contain float
- The pain is known only when we load the entire dataframe
- This is NOT a new problem, and hadoop (e.g. sqoop) handles it by loading all data attributes as strings, as that is the most flexible type that can encompass other data types. We will later convert them to appropriate data type based on where the data is needed and interpreted
- The basic idea is to ONLY read the first row (header row) and assume they are columns. Then for each column, we coerce them into `str` data type and form our data type dictionary. After that we use the dictionary to read the entire file by coercing all columns as strings

In [1]:
import pandas as pd

col_names = pd.read_csv('./data/usa_email_sample_db.csv', nrows=0).columns
types_dict = {}
types_dict.update({col: str for col in col_names})
df = pd.read_csv('./data/usa_email_sample_db.csv', dtype=types_dict)
df

Unnamed: 0,Business Name,Email,Category,Category 2,Category 3,Address,City,State,Postal,Phone,Website
0,Stone Cove Marina Inc,NOT IN SAMPLE,Docks,Marinas,Dock Builders,134 Salt Pond Rd,Wakefield,RI,2879,(401) 783-8990,http://stonecovemarinari.com
1,Bluehaven Homes,NOT IN SAMPLE,General Contractors,Home Builders,,5701 Time Sq,Amarillo,TX,79119,(806) 452-2545,http://www.bluehavenhomes.com/
2,Michael Jays Tattoo Body Piercing Clinic,NOT IN SAMPLE,Jewelers,Body Piercing,Tattoos,1929 N Washington St,Bismarck,ND,58501,(701) 222-8282,http://michaeljaystattoo.com
3,Cardona-Hine Gallery,NOT IN SAMPLE,"Art Galleries, Dealers Consultants",,,82 County Road 75,Truchas,NM,87578,(505) 689-2253,http://cardonahinegallery.com
4,Cancun,NOT IN SAMPLE,Mexican Restaurants,,,2134 Allston Way,Berkeley,CA,94704,(510) 549-0964,http://www.sabormexicano.com/cancun
...,...,...,...,...,...,...,...,...,...,...,...
581,America Auto Sales,NOT IN SAMPLE,Used Car Dealers,New Car Dealers,,1614 E Irving Blvd,Irving,TX,75060,(972) 721-1331,http://paramountautocenter.com
582,Site Specialties Inc,NOT IN SAMPLE,Playgrounds,Recreation Centers,,1141 Park St,Loganville,GA,30052,(770) 784-0080,http://sitespecialtiesinc.com
583,Old Campbell County Historical Society Inc,NOT IN SAMPLE,Museums,Cultural Centers,,45 NE Broad St,Fairburn,GA,30213,(770) 969-5618,http://museumsusa.org
584,Apex Car Service,NOT IN SAMPLE,Taxis,Transportation Providers,,365 Choate Rd,Canaan,NH,3741,(603) 252-8295,http://apexcarservice.com


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586 entries, 0 to 585
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Business Name  586 non-null    object
 1   Email          586 non-null    object
 2   Category       586 non-null    object
 3   Category 2     311 non-null    object
 4   Category 3     68 non-null     object
 5   Address        583 non-null    object
 6   City           582 non-null    object
 7   State          582 non-null    object
 8   Postal         579 non-null    object
 9   Phone          585 non-null    object
 10  Website        580 non-null    object
dtypes: object(11)
memory usage: 50.5+ KB
