## New Story, Providing New Housing for Earthquake Survivors

This project is on behalf of New Story, California based NPO which provide new housings to the survivors of Haitian earthqiakes. The purpose of the project is to find the most vulnerable group of survivors, who are naturally more in needs and should be prioritized in the provision of the new housing. 

Methodologies: Using clustering techniques, grouping the surviving households into 3 or 4 segments based on the survey data with 77 columns, the majority of which are text-based. Using NLP techniques, convert these text data columns into numeric data columns. The first step is runing k-means cluster on each text column to group the observed households into 3 segments, giving segment id (from 1 to 3) to each household. After converting all text data columns in this manner, running k-mean clustering again on entire samples of surviving households using all the converted columns as features. The last step is to examin each resulting segment to determine which group is the most in needs.

Data manipulation: Reduce columns from 77 to 46 by deleting ones that don't contain any meaningful information (such as the survey form number, interviwer id, names of the residents and the schools etc). Among the osberved households, there are some outliers (such as household 10 & 130) which have more than 6 full-time residents besides 2 head of the house. Run clustering with or without these osbervation and see if the
re's significant differences in the result.


In [74]:
import pandas as pd
pd.options.display.max_columns = 300
pd.options.display.max_rows = 300
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import sys
sys.setrecursionlimit(100000)
pd.set_option('display.max_colwidth',300)

In [75]:
ti_base_rec = pd.read_csv('/Users/satokosuda/dataforcause/new_story_data/ti_base_rec.csv')
# ti_base_rec > original data set

In [76]:
ti_base_rec.info()
ti_base_rec.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Columns: 285 entries, Form Name to R8 Additional Comments
dtypes: float64(107), int64(5), object(173)
memory usage: 1.1+ MB


Unnamed: 0,Longitude,Tent ID,Lottery #,# of Residents less than 18 yr.,# of Residents more than 18 yr.,# of Tent Residents,Head of Household 1,HH1 Age,HH1 Vendor - Days Working per Week,HH2,HH2 Age,HH2 Vendor Location,HH2 Vendor - Days Working per Week,# of Years Living in Village,# of Years Living in Tent,Rent or Own Elsewhere - Other,Marital Status - # of Years Together,Unnamed: 66,Sleep # of People Per Night,Sleep Time,Wake-up Time,Additional Comments - Electricity,How many gallons of water does your household use per day?,How long does it take you to get your water?,Unnamed: 92,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,How many times were you stolen from?,Unnamed: 115,Unnamed: 116,Unnamed: 122,Unnamed: 123,Unnamed: 124,Unnamed: 125,Do you understand this interview does not guarantee that you ll receive a house from MOH?,Unnamed: 133,Unnamed: 134,"Floors, if other please specify",R1 Nickname,R1 Age,R1 Class Name,R1 Occupation - Other,R1 Vendor,R1 Vendor Location,R1 Vendor - Days Working per Week,Full-time Tent Resident 2,R2 Nickname,R2 Age,R2 Occupation - Other,R2 Vendor,R2 Vendor Location,R2 Vendor - Days Working per Week,Full-time Tent Resident 3,R3 Nickname,R3 Age,R3 Class Name,R3 Occupation - Other,R3 Vendor - Other,R3 Vendor Location,R3 Vendor - Days Working per Week,Full-time Tent Resident 4,R4 Nickname,R4 Age,R4 Class Name,R4 Occupation - Other,R4 Vendor - Other,R4 Vendor Location,R4 Vendor - Days Working per Week,Full-time Tent Resident 5,R5 Nickname,R5 Age,R5 Grade/Year in School - Education,R5 Class Name,R5 Occupation - Other,R5 Vendor,R5 Vendor - Other,R5 Vendor Location,R5 Vendor - Days Working per Week,Full-time Tent Resident 6,R6 Nickname,R6 Age,R6 Grade/Year in School - Education,R6 Class Name,R6 Occupation - Other,R6 Vendor,R6 Vendor - Other,R6 Vendor Location,R6 Vendor - Days Working per Week,Full-time Tent Resident 7,R7 Nickname,R7 Age,R7 Name of School - Education,R7 Grade/Year in School - Education,R7 Class Name,R7 Occupation - Other,R7 Vendor,R7 Vendor - Other,R7 Vendor Location,R7 Vendor - Days Working per Week,Full-time Tent Resident 8,R8 Nickname,R8 Age,R8 Name of School - Education,R8 Grade/Year in School - Education,R8 Class Name,R8 Occupation - Other,R8 Vendor,R8 Vendor - Other,R8 Vendor Location,R8 Vendor - Days Working per Week
count,55.0,526.0,70.0,526.0,526.0,526.0,284.0,181.0,166.0,0.0,104.0,0.0,80.0,526.0,519.0,0.0,332.0,0.0,509.0,525.0,509.0,0.0,499.0,0.0,0.0,0.0,0.0,0.0,0.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,248.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,202.0,0.0,0.0,0.0,0.0,0.0,0.0,135.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,84.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,48.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,-72.352654,377.112167,35.5,1.863118,2.091255,3.952471,4.221831,38.364641,4.463855,,40.028846,,4.4,14.671103,5.786127,,10.394578,,3.91945,8.493333,5.616896,,16.625251,,,,,,,2.685714,,,,,,,,,,,,17.076613,,,,,,,,13.10396,,,,,,,10.044444,,,,,,,,9.035714,,,,,,,,9.5625,3.5,,,,,,,,,8.434783,1.0,,,,,,,,,9.4,,,,,,,,,,,9.25,,,,,,,,
std,0.005849,219.283606,20.351085,1.565183,0.955804,1.952684,2.14577,11.82228,1.686458,,12.956579,,1.472103,10.757421,1.810128,,8.919125,,1.920516,0.732593,0.784062,,12.484281,,,,,,,2.083146,,,,,,,,,,,,13.482864,,,,,,,,9.893668,,,,,,,7.878619,,,,,,,,8.278942,,,,,,,,10.837104,3.535534,,,,,,,,,6.867751,,,,,,,,,,9.656974,,,,,,,,,,,7.304597,,,,,,,,
min,-72.360437,2.0,1.0,0.0,1.0,1.0,1.0,18.0,2.0,,18.0,,2.0,1.0,1.0,,0.0,,1.0,6.0,4.0,,2.0,,,,,,,1.0,,,,,,,,,,,,0.0,,,,,,,,0.0,,,,,,,0.0,,,,,,,,0.0,,,,,,,,1.0,1.0,,,,,,,,,1.0,1.0,,,,,,,,,0.0,,,,,,,,,,,2.0,,,,,,,,
25%,-72.356343,181.75,18.25,1.0,2.0,3.0,3.0,29.0,3.0,,29.0,,3.75,7.0,5.0,,4.0,,3.0,8.0,5.0,,5.0,,,,,,,1.0,,,,,,,,,,,,8.0,,,,,,,,6.25,,,,,,,4.0,,,,,,,,4.0,,,,,,,,3.0,2.25,,,,,,,,,2.5,1.0,,,,,,,,,1.0,,,,,,,,,,,5.5,,,,,,,,
50%,-72.35575,373.5,35.5,2.0,2.0,4.0,4.0,38.0,5.0,,38.0,,5.0,11.0,7.0,,7.0,,4.0,9.0,6.0,,15.0,,,,,,,2.0,,,,,,,,,,,,15.0,,,,,,,,12.0,,,,,,,9.0,,,,,,,,7.0,,,,,,,,6.0,3.5,,,,,,,,,6.0,1.0,,,,,,,,,8.0,,,,,,,,,,,6.5,,,,,,,,
75%,-72.346497,577.75,52.75,3.0,2.0,5.0,5.25,46.0,6.0,,50.0,,6.0,21.0,7.0,,15.0,,5.0,9.0,6.0,,25.0,,,,,,,3.0,,,,,,,,,,,,20.0,,,,,,,,17.0,,,,,,,13.5,,,,,,,,11.0,,,,,,,,14.25,4.75,,,,,,,,,13.5,1.0,,,,,,,,,11.0,,,,,,,,,,,10.5,,,,,,,,
max,-72.343669,728.0,70.0,8.0,7.0,13.0,10.0,78.0,7.0,,75.0,,7.0,65.0,10.0,,41.0,,13.0,11.0,9.0,,50.0,,,,,,,10.0,,,,,,,,,,,,88.0,,,,,,,,75.0,,,,,,,50.0,,,,,,,,58.0,,,,,,,,68.0,6.0,,,,,,,,,23.0,1.0,,,,,,,,,34.0,,,,,,,,,,,23.0,,,,,,,,


In [77]:
ti_base_rec.shape

(526, 285)

In [78]:
ti_base_rec.tail()

Unnamed: 0,Form Name,Created By,Created At,Web Link,Latitude,Longitude,Tent ID,Lottery #,# of Residents less than 18 yr.,# of Residents more than 18 yr.,# of Tent Residents,Head of Household 1,HH1 ID Photo,HH1 Last Name,HH1 First Name,HH1 Nickname,HH1 Primary Phone #,HH1 Secondary Phone 3,HH1 Sex,HH1 Age,HH1 Occupation,HH1 Occupation - Other,HH1 Vendor,HH1 Vendor - Other,HH1 Vendor Location,HH1 Vendor - Days Working per Week,Additional Comments,HH2,HH2 Last Name,HH2 First Name,HH2 Nickname,HH2 Primary Phone #,HH2 Secondary Phone 3,HH2 Sex,HH2 Age,HH2 Occupation,HH2 Occupation - Other,HH2 Vendor,HH2 Vendor - Other,HH2 Vendor Location,HH2 Vendor - Days Working per Week,HH2 Additional Comments,Full-time Tent Resident,# of Years Living in Village,Additional Comments - Living in Village,# of Years Living in Tent,# of People Living full time in tent,Education - School Attendance,Education - Name of School,Children Living Elsewhere,# of Children Living Elsewhere,Additional Comments - Children Living Elsewhere,Ownership,Ownership - Other,How much do you pay for rent?,Rent or Own Elsewhere,Rent or Own Elsewhere - Other,Previous Ownership,Marital Status,Marital Status - # of Years Together,Marital Status - Other,Marital Status - Additional Comments,Family Bacgkround Audio,Family Bacgkround,Problems in the Tent - Written,Problems in the Tent - Additional Comments,Unnamed: 66,Sleep # of People Per Night,Sleep - Difficulty Sleeping [Do you ever have issues sleeping?],Sleep - Frequency [How frequently do you have issues sleeping?],Do you normally wake up at night?,"Normally, how frequently do you wake up night?",What are the top two reasons you wake up at night or have issues the sleep?,Do you ever have trouble staying awake during the day?,"In the past week, how often did you have trouble staying awake?",Sleep Time,Wake-up Time,How often do you get sick?,Additional Comments - Health,"In the past year, did someone in this home suffer from cough, congestion or similar problems?","During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?","If yes, how frequently did this person suffer from bronchitis, bronchiolitis or pneumonia?","In the past month, did anyone living in the tent suffer from diarrhea?",Do you have access to a latrine?,Additional Comments - Latrine,Do you have electricity in your tent?,Additional Comments - Electricity,What is the main source of drinking water for members of your household?,How many gallons of water does your household use per day?,Do you ever drink water that isn't treated?,How long does it take you to get your water?,"Do you have any other comments, questions or other information you’d like to add?",Unnamed: 92,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Is there any risk that the tent will collapse?,In the past year did someone enter your house to steal something?,How many times were you stolen from?,If you leave your tent are you concerned that someone ll steal from you?,"How often do you have friends, family or neighborhoods over to your tent?",What is the reason you do not have people over in the tent?,Do you have space to lie down if tired?,Do people living in the tent have space to keep their personal belongings?,"In this tent, if someone wakes up, do they wake up the other people?",Do children have safe places to study?,Does your household own any animals?,"If yes, how many?",Does the household own a radio?,If yes does it function?,AUDIO Would living in a block home create any changes in your life? AUDIO,Would living in a block home create any changes in your life?,Unnamed: 113,Unnamed: 114,Unnamed: 115,Unnamed: 116,Do you feel safe in your home?,What is the main thing that makes you feel safe?,What is the main thing that makes you feel unsafe?,Do you feel safe leaving your children alone at home?,Do you feel safe walking in the community at night?,Unnamed: 122,Unnamed: 123,Unnamed: 124,Unnamed: 125,Do you understand this interview does not guarantee that you ll receive a house from MOH?,Do you own the land the tent is on?,Is the deed to the property in your name? -1,Is the deed to the property in your name? -2,Additional Comments - Land Tenure,"If you were to receive a house, would you be willing to move to the area behind Healing Haiti?","If no, what is the main reason you are unwilling to move?",Unnamed: 133,Unnamed: 134,What are the dwellings floors made of?,"Floors, if other please specify",What are the dwellings roof made of?,Photo1,Photo2,Photo3,Photo4,Photo5,Photo6,Photo7,Photo8,Do you feel this person qualifies for a home?,"If no, explain.",Additional Comments - Home Qualification,Full-time Tent Resident 1,R1 Last Name,R1 First Name,R1 Nickname,R1 Age,R1 Sex,R1 Occupation,R1 Name of School - Education,R1 Grade/Year in School - Education,R1 Class Name,R1 Occupation - Other,R1 Vendor,R1 Vendor - Other,R1 Vendor Location,R1 Vendor - Days Working per Week,R1 Relationship,R1 Additional Comments,Full-time Tent Resident 2,R2 Last Name,R2 First Name,R2 Nickname,R2 Age,R2 Sex,R2 Occupation,R2 Name of School - Education,R2 Grade/Year in School - Education,R2 Class Name,R2 Occupation - Other,R2 Vendor,R2 Vendor - Other,R2 Vendor Location,R2 Vendor - Days Working per Week,R2 Relationship,R2 Additional Comments,Full-time Tent Resident 3,R3 Last Name,R3 First Name,R3 Nickname,R3 Age,R3 Sex,R3 Occupation,R3 Name of School - Education,R3 Grade/Year in School - Education,R3 Class Name,R3 Occupation - Other,R3 Vendor,R3 Vendor - Other,R3 Vendor Location,R3 Vendor - Days Working per Week,R3 Relationship,R3 Additional Comments,Full-time Tent Resident 4,R4 Last Name,R4 First Name,R4 Nickname,R4 Age,R4 Sex,R4 Occupation,R4 Name of School - Education,R4 Grade/Year in School - Education,R4 Class Name,R4 Occupation - Other,R4 Vendor,R4 Vendor - Other,R4 Vendor Location,R4 Vendor - Days Working per Week,R4 Relationship,R4 Additional Comments,Full-time Tent Resident 5,R5 Last Name,R5 First Name,R5 Nickname,R5 Age,R5 Sex,R5 Occupation,R5 Name of School - Education,R5 Grade/Year in School - Education,R5 Class Name,R5 Occupation - Other,R5 Vendor,R5 Vendor - Other,R5 Vendor Location,R5 Vendor - Days Working per Week,R5 Relationship,R5 Additional Comments,Full-time Tent Resident 6,R6 Last Name,R6 First Name,R6 Nickname,R6 Age,R6 Sex,R6 Occupation,R6 Name of School - Education,R6 Grade/Year in School - Education,R6 Class Name,R6 Occupation - Other,R6 Vendor,R6 Vendor - Other,R6 Vendor Location,R6 Vendor - Days Working per Week,R6 Relationship,R6 Additional Comments,Full-time Tent Resident 7,R7 Last Name,R7 First Name,R7 Nickname,R7 Age,R7 Sex,R7 Occupation,R7 Name of School - Education,R7 Grade/Year in School - Education,R7 Class Name,R7 Occupation - Other,R7 Vendor,R7 Vendor - Other,R7 Vendor Location,R7 Vendor - Days Working per Week,R7 Relationship,R7 Additional Comments,Full-time Tent Resident 8,R8 Last Name,R8 First Name,R8 Nickname,R8 Age,R8 Sex,R8 Occupation,R8 Name of School - Education,R8 Grade/Year in School - Education,R8 Class Name,R8 Occupation - Other,R8 Vendor,R8 Vendor - Other,R8 Vendor Location,R8 Vendor - Days Working per Week,R8 Relationship,R8 Additional Comments
521,TITAYEN - Ayiti | Intake Sondaj,Village Champion 2 Village Champion 2,2017-07-04T10:56:20.152Z,https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd4d8c3e190c00214f47,,,726,,1,2,3,3.0,https://s3.amazonaws.com/qform/images/19b7ec68-8a03-3176-68b6-e6d98857ccdc/58c1b9862c86620c00284b4e,Joseph,Franzt,,4645-1312,,Male,,Contracted Worker,Construction,,,,,,,Raymond,Daphca,,,,Female,,Small business outside or nearby the home,,"Diri, pwa oubyen bannann ou prodwi",,,4.0,,Please see record detail page - https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd4d8c3e190c00214f47,10,Mwen gen 10 ane,1.0,3,No,,No,,,Own,,,No,,Rent a house,Common Law,7.0,,,https://s3.amazonaws.com/qform/audio/19b7ec68-8a03-3176-68b6-e6d98857ccdc/583f26b8174c850d00c016ab,Audio,https://s3.amazonaws.com/qform/audio/19b7ec68-8a03-3176-68b6-e6d98857ccdc/583f26b8174c850d00c016a9,Audio,,3.0,Yes,Very often,No,,,Yes,5-6 times,9.0,6.0,Sometimes,Grangou ak fatig ak tet femal,Yes,No,,Yes,No,Nou ale nan raje,No,,Ponp oubyen Pi,10.0,Yes,,Selim bwÃ¨,,,,,,Yes,No,,No,Sometimes,,No,No,No,No,No,,No,,https://s3.amazonaws.com/qform/audio/19b7ec68-8a03-3176-68b6-e6d98857ccdc/583f26b8174c850d00c01691,Audio,,,,,Yes,Paskem pakonnen nan kont,,No,Yes,,,,,,No,,,Terre leta,Yes,,,,Dirt or soil,,Tin,https://s3.amazonaws.com/qform/images/19b7ec68-8a03-3176-68b6-e6d98857ccdc/583f26b8174c850d00c01668,https://s3.amazonaws.com/qform/images/19b7ec68-8a03-3176-68b6-e6d98857ccdc/583f26b8174c850d00c01667,https://s3.amazonaws.com/qform/images/19b7ec68-8a03-3176-68b6-e6d98857ccdc/583f26b8174c850d00c01666,https://s3.amazonaws.com/qform/images/19b7ec68-8a03-3176-68b6-e6d98857ccdc/583f26b8174c850d00c01665,https://s3.amazonaws.com/qform/images/19b7ec68-8a03-3176-68b6-e6d98857ccdc/583f26b8174c850d00c01664,,,,Yes,,Tant lam cho,,Joseph,Wisedael,,1.0,Female,Nothing,,,,,,,,,Pitit fi,Mwen seyon Bebe,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
522,TITAYEN - Ayiti | Intake Sondaj,Village Champion 2 Village Champion 2,2017-07-04T11:09:33.867Z,https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd402c06931800ecdaa4,,,727,,3,1,4,4.0,https://s3.amazonaws.com/qform/images/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/58c1b9862c86620c00284b4e,Guerrier,Jacqueline,,3770-1906,,Female,,Nothing,,,,,,,,,,,,,,,,,,,,,,Please see record detail page - https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd402c06931800ecdaa4,12,Mgen 12 ane Isi,7.0,4,Yes,Titanyen,No,,,Own,,,No,,Own House,Widow,,,,https://s3.amazonaws.com/qform/audio/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/583f26b8174c850d00c016ab,Audio,https://s3.amazonaws.com/qform/audio/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/583f26b8174c850d00c016a9,Audio,,4.0,Yes,Very often,No,,,Yes,5-6 times,9.0,7.0,Sometimes,Grangou tÃ¨t femal ak grip,Yes,No,,Yes,No,Nan raje,No,,Ponp oubyen Pi,5.0,Yes,,Selim bwÃ¨,,,,,,Yes,No,,No,Sometimes,,No,No,No,No,No,,No,,https://s3.amazonaws.com/qform/audio/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/583f26b8174c850d00c01691,Audio,,,,,Yes,Jezi avÃ¨m,,No,Yes,,,,,,No,,,Terre leta,Yes,,,,Dirt or soil,,Tin,https://s3.amazonaws.com/qform/images/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/583f26b8174c850d00c01668,https://s3.amazonaws.com/qform/images/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/583f26b8174c850d00c01667,https://s3.amazonaws.com/qform/images/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/583f26b8174c850d00c01666,https://s3.amazonaws.com/qform/images/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/583f26b8174c850d00c01665,https://s3.amazonaws.com/qform/images/158bf5ea-4c99-39b7-2ac8-5ab3c39fe28b/583f26b8174c850d00c01664,,,,Yes,,Tol la cho,,Mercier,Yvenia,,8.0,Female,Nothing,,,,,,,,,Pitit fi,Mwen lekòl,,Edouard,Dave,,5.0,Male,Nothing,,,,,,,,,Pitit gason,Mwen lekòl,,Edouard,Daveson,,4.0,Male,Nothing,,,,,,,,,Pitit gason,Mwen lekòl,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
523,TITAYEN - Ayiti | Intake Sondaj,Village Champion 2 Village Champion 2,2017-07-04T11:32:55.531Z,https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd462c06931800ecdaa5,,,94,,0,1,1,1.0,https://s3.amazonaws.com/qform/images/523ad96f-f175-ca02-f306-649516b31103/58c1b9862c86620c00284b4e,Merilus,Marie louise,,3906-1469,,Female,,Nothing,,,,,,,,,,,,,,,,,,,,,,Please see record detail page - https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd462c06931800ecdaa5,31,Mgen 31 ane nan zone sa,7.0,1,No response/cannot remember,,Yes,5.0,Yo lakay yo,Own,,,No,,Own House,Widow,,,,,Marim kite vin cheche travay isi,,Li cho plastik la pabon,,1.0,Yes,Very often,No,,,Yes,5-6 times,8.0,8.0,Sometimes,Menm genyon tansyon kite jetem,Yes,No,,Yes,No,Nan raje,No,,Ponp oubyen Pi,5.0,Yes,,Lem pagen kob,,,,,,Yes,No,,No,Sometimes,,No,No,No,No,No,,No,,,Map santim map mouri byen,,,,,Yes,Jezi nan lavim,,No,Yes,,,,,,No,,,Terre leta,Yes,,,,Dirt or soil,,Tin,https://s3.amazonaws.com/qform/images/523ad96f-f175-ca02-f306-649516b31103/583f26b8174c850d00c01668,https://s3.amazonaws.com/qform/images/523ad96f-f175-ca02-f306-649516b31103/583f26b8174c850d00c01667,https://s3.amazonaws.com/qform/images/523ad96f-f175-ca02-f306-649516b31103/583f26b8174c850d00c01666,https://s3.amazonaws.com/qform/images/523ad96f-f175-ca02-f306-649516b31103/583f26b8174c850d00c01665,,,,,Yes,,Li pagen moun pou bali,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
524,TITAYEN - Ayiti | Intake Sondaj,Village Champion 2 Village Champion 2,2017-07-04T11:52:51.939Z,https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd376669260c005ac0b7,,,728,,1,1,2,2.0,https://s3.amazonaws.com/qform/images/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/58c1b9862c86620c00284b4e,Perceval,Judeline,,4767-3155,,Female,,Nothing,,,,,,,,,,,,,,,,,,,,,,Please see record detail page - https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd376669260c005ac0b7,25,Mgen Isi,5.0,2,Yes,Nan moh,No,,,Own,,,No,,Live with Family,Single,,,,https://s3.amazonaws.com/qform/audio/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/583f26b8174c850d00c016ab,Audio,https://s3.amazonaws.com/qform/audio/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/583f26b8174c850d00c016a9,Audio,,2.0,Yes,Very often,No,,,Yes,5-6 times,9.0,7.0,Sometimes,Fatig ak grangou,Yes,No,,Yes,No,Nan raje,No,,Ponp oubyen Pi,5.0,Yes,,Selim bwe,,,,,,Yes,No,,No,Sometimes,,No,No,No,No,No,,No,,https://s3.amazonaws.com/qform/audio/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/583f26b8174c850d00c01691,Audio,,,,,No,,Tant lab pabon,No,Yes,,,,,,No,,,Terre leta,Yes,,,,Dirt or soil,,Tin,https://s3.amazonaws.com/qform/images/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/583f26b8174c850d00c01668,https://s3.amazonaws.com/qform/images/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/583f26b8174c850d00c01667,https://s3.amazonaws.com/qform/images/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/583f26b8174c850d00c01666,https://s3.amazonaws.com/qform/images/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/583f26b8174c850d00c01665,https://s3.amazonaws.com/qform/images/6c8df7cc-bdad-de3a-bcd6-d51d1f2961da/583f26b8174c850d00c01664,,,,Yes,,Sase pa kote moun rete,,Perceval,James,,7.0,Male,Nothing,,,,,,,,,Pitit gason,M lekòl,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
525,TITAYEN - Ayiti | Intake Sondaj,Village Champion 2 Village Champion 2,2017-07-04T12:17:41.115Z,https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd2e2c06931800ecdaa3,,,724,7.0,1,2,3,,https://s3.amazonaws.com/qform/images/f5491364-4298-ce62-403f-240a7540b6c5/58c1b9862c86620c00284b4e,Charles,Micheline,,3884-8638,,Female,,Small business outside or nearby the home,,Other,Gas,,7.0,,,Jean wilgousse,Charles,,,,Male,,Other,,,,,,,Please see record detail page - https://app.formyoula.com/templates/583f26b8174c850d00c01654/forms/595bcd2e2c06931800ecdaa3,12,I've been here for 12 years.,7.0,3,Yes,Titanyen,No,,,Own,,,No,,Rent a house,Married,23.0,,,https://s3.amazonaws.com/qform/audio/f5491364-4298-ce62-403f-240a7540b6c5/583f26b8174c850d00c016ab,My husband had a problem. We were angry with his family. He left and went out in the country where he spent 3 months. After 3 months we moved to Titanyen.,https://s3.amazonaws.com/qform/audio/f5491364-4298-ce62-403f-240a7540b6c5/583f26b8174c850d00c016a9,It leaks. It leaks. It's too narrow.,,3.0,Yes,Very often,No,,,Yes,5-6 times,9.0,7.0,Sometimes,Headaches and the flu,Yes,No,,Yes,Yes,It's not good at all.,No,,Pump or well,10.0,Yes,,When I don't have any money.,,,,,,Yes,No,,No,Sometimes,,No,No,No,No,No,,No,,https://s3.amazonaws.com/qform/audio/f5491364-4298-ce62-403f-240a7540b6c5/583f26b8174c850d00c01691,"When you live in a tent, how do you feel? First of all, I'll thank God. I thank God because he gave me a house to sleep in. I sleep well and wake up well, so I thank God.",,,,,Yes,I have Jesus.,,Yes,Yes,,,,,,No,,,Inherited land,Yes,,,,Dirt or soil,,Tin,https://s3.amazonaws.com/qform/images/f5491364-4298-ce62-403f-240a7540b6c5/583f26b8174c850d00c01668,https://s3.amazonaws.com/qform/images/f5491364-4298-ce62-403f-240a7540b6c5/583f26b8174c850d00c01667,https://s3.amazonaws.com/qform/images/f5491364-4298-ce62-403f-240a7540b6c5/583f26b8174c850d00c01666,https://s3.amazonaws.com/qform/images/f5491364-4298-ce62-403f-240a7540b6c5/583f26b8174c850d00c01665,https://s3.amazonaws.com/qform/images/f5491364-4298-ce62-403f-240a7540b6c5/583f26b8174c850d00c01664,,,,Yes,,They're not living well.,,Charles,Kezer,,12.0,Male,Anyen,,,,,,,,,Pitit gason,Mwen lekòl,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Drop columns with no meningful information and create data frame df1

In [79]:
df1 = ti_base_rec.drop(['Form Name', 'Created By', 'Created At','Web Link','Latitude','Longitude','Lottery #','HH1 ID Photo','HH2','Full-time Tent Resident','Family Bacgkround Audio','Problems in the Tent - Written','Additional Comments - Electricity','How long does it take you to get your water?','AUDIO Would living in a block home create any changes in your life? AUDIO'], axis=1)
df1 = df1[df1.columns.drop(list(df1.filter(regex='Photo\d+', axis=1)))]
df1 = df1[df1.columns.drop(list(df1.filter(regex='[A-Z0-9]*\sLast Name', axis=1)))]
df1 = df1[df1.columns.drop(list(df1.filter(regex='[A-Z0-9]*\sFirst Name', axis=1)))]
df1 = df1[df1.columns.drop(list(df1.filter(regex='[A-Z0-9]*\sNickname', axis=1)))]
df1 = df1[df1.columns.drop(list(df1.filter(regex='[A-Z0-9]*\sPrimary\sPhone\s#', axis=1)))]
df1 = df1[df1.columns.drop(list(df1.filter(regex='[A-Z0-9]*\sSecondary\sPhone\s3', axis=1)))]
df1 = df1[df1.columns.drop(list(df1.filter(regex='[A-Z0-9]*\sOccupation - Other', axis=1)))]
df1 = df1[df1.columns.drop(list(df1.filter(regex='[A-Z0-9]*\sVendor[\-A-Za-z]*', axis=1)))]
df1 = df1[df1.columns.drop(list(df1.filter(regex='Unnamed:\s[0-9]*', axis=1)))]

### Count missing values for each column

In [80]:
df1.head()

df1.isnull().sum()

Tent ID                                                                                                  0
# of Residents less than 18 yr.                                                                          0
# of Residents more than 18 yr.                                                                          0
# of Tent Residents                                                                                      0
Head of Household 1                                                                                    242
HH1 Sex                                                                                                  0
HH1 Age                                                                                                345
HH1 Occupation                                                                                           0
Additional Comments                                                                                    349
HH2 Sex                              

### Erase leading and trailing white spaces and new lines

In [81]:
df1.columns = df1.columns.str.strip()

### Create data frame df1_new, which contains columns with less than 30 missing values

In [82]:
condition = (df1.isnull().sum() < 30)
df1_new = df1.loc[:, condition]

In [83]:
df1_new.shape
#Reduced to (526, 53)
df1_new.sample(2)

Unnamed: 0,Tent ID,# of Residents less than 18 yr.,# of Residents more than 18 yr.,# of Tent Residents,HH1 Sex,HH1 Occupation,# of Years Living in Village,Additional Comments - Living in Village,# of Years Living in Tent,# of People Living full time in tent,Education - School Attendance,Children Living Elsewhere,Ownership,Rent or Own Elsewhere,Previous Ownership,Marital Status,Family Bacgkround,Problems in the Tent - Additional Comments,Sleep # of People Per Night,Sleep - Difficulty Sleeping [Do you ever have issues sleeping?],Do you ever have trouble staying awake during the day?,Sleep Time,Wake-up Time,How often do you get sick?,Additional Comments - Health,"In the past year, did someone in this home suffer from cough, congestion or similar problems?","During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?","In the past month, did anyone living in the tent suffer from diarrhea?",Do you have access to a latrine?,Do you have electricity in your tent?,What is the main source of drinking water for members of your household?,How many gallons of water does your household use per day?,Do you ever drink water that isn't treated?,"Do you have any other comments, questions or other information you’d like to add?",Is there any risk that the tent will collapse?,In the past year did someone enter your house to steal something?,If you leave your tent are you concerned that someone ll steal from you?,"How often do you have friends, family or neighborhoods over to your tent?",Do you have space to lie down if tired?,Do people living in the tent have space to keep their personal belongings?,"In this tent, if someone wakes up, do they wake up the other people?",Do children have safe places to study?,Does your household own any animals?,Does the household own a radio?,Would living in a block home create any changes in your life?,Do you feel safe in your home?,Do you feel safe leaving your children alone at home?,Do you feel safe walking in the community at night?,Do you own the land the tent is on?,"If you were to receive a house, would you be willing to move to the area behind Healing Haiti?",What are the dwellings floors made of?,What are the dwellings roof made of?,Do you feel this person qualifies for a home?
116,299,1,2,3,Male,Nothing,28,I was born here.,7.0,3,Yes,No,Own,No,Rent a house,Common Law,"The reason I came to Titanyen... My mom, who was in Jacmel, came to Titanyen with us because her husband died and she wasn't able to take care of us, so she came to live in Titanyen with us.","When it's sunny, it's not good at all, because it's always hot. Also, when it rains, evenrthing inside gets wet. The tin covering it aren't good and neither are the walls.",3.0,Yes,Yes,9.0,5.0,Sometimes,The flu,No,No,No,No,No,Pump or well,20.0,Yes,That's what we drink.,Yes,No,No,Sometimes,No,No,No,No,No,No,"I'll feel much better when I get a block house because under the tent I would get wet when it rains. Also, the sun wasn't good for me. When I get a block house, I'll feel more comfortable.",Yes,No,Yes,No,Yes,Dirt or soil,Tin,Yes
162,257,1,3,4,Female,Small business outside or nearby the home,5,I've been here for 5 years.,5.0,4,Yes,No,Own,No,Rent a house,Single,"The man I was with died, and I wasn't able to pay for a house, so I came here and rented a small piece of land so that I can arrnage myself.","When it's windy, it shakes a lot. Also, when it rains we get wet inside. We have a lot of difficulties. Sometimes critters come inside and bite the kids. But it's what we have, we endure it.",4.0,Yes,Yes,8.0,6.0,Sometimes,"The flu, headaches and fatigue",Yes,No,Yes,Yes,No,Pump or well,20.0,Yes,When I don't have money.,Yes,No,No,Sometimes,No,No,No,No,No,No,"They're always different. The cold you feel in a tent is different in a block house. When you sleep in a block house, it's not the same feeling. With the tarp, you're cold at night. Also, when you lay down, you don't feel safe the same way you do when you're in a block house.",Yes,No,Yes,Yes,Yes,Dirt or soil,Tin,Yes


#### Encode categorical data (Sex, Ocupations) and make them into dataframes

In [84]:
encoder = LabelEncoder()
hh1_sex = df1_new['HH1 Sex']
hh1_sex_encoded = encoder.fit_transform(hh1_sex)

In [85]:
print(encoder.classes_)
#hh1_sex_encoded[:10]   # numpy array of 1 x 1 (526 elements)
hh1_sex_encoded.reshape(-1,1)[:10]  
# numpy array of 526 X 1 : intention here is to reshape numpy array into one-column format. 
# The # of rows are not important, so row numbers are simply implied by -1. Numpy automatically figures out that there
# is going to be 526 rows in total.

['Female' 'Male']


array([[1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0]])

In [86]:
hencoder=OneHotEncoder()
hh1_sex_1hot=hencoder.fit_transform(hh1_sex_encoded.reshape(-1,1))
hh1_sex_1hot

<526x2 sparse matrix of type '<class 'numpy.float64'>'
	with 526 stored elements in Compressed Sparse Row format>

In [87]:
hh1_sex_1hot.toarray()
#hh1_sex_1hot.toarray().dtype  #convert sparse matrix into numpy array before turn it into dataframe

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])

In [88]:
dfsex = pd.DataFrame(hh1_sex_1hot.toarray(), columns = ['HH1 Female', 'HH1 Male'])
dfsex.head()

Unnamed: 0,HH1 Female,HH1 Male
0,0.0,1.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


In [89]:
df1_new['HH1 Sex'].head()

0      Male
1    Female
2    Female
3    Female
4    Female
Name: HH1 Sex, dtype: object

In [90]:
#HH1_sex = pd.SparseDataFrame(hh1_sex_1hot, columns=encoder.classes_)
#HH1_sex = HH1_sex.fillna(0)
#HH1_sex.head()
# Do not merge sparse dataframe with df1 becuase it cannot merge with other columns in df1. Instead, convert 
# sparse matrix 'hh1_sex_1hot into dense numpy array by .toarray() first, convert it to dataframe first, then merge
# with df1.

In [91]:
encoder = LabelEncoder()
hh1_occupation = df1_new['HH1 Occupation']
hh1_occupation_encoded = encoder.fit_transform(hh1_occupation)
hh1_occupation_encoded
print(encoder.classes_)

['Agriculture/Fish' 'Contracted Worker' 'Driver' 'Family Provides'
 'Laundry / Servant' 'Laundry/Housekeeper' 'Lesiv/Servant' 'Nothing'
 'Other' 'Paid Consistent Job' 'Small business outside or nearby the home'
 'Student' 'Vendor']


In [92]:
hencoder=OneHotEncoder()
hh1_occupation_1hot=hencoder.fit_transform(hh1_occupation_encoded.reshape(-1,1))
dfoccu = pd.DataFrame(hh1_occupation_1hot.toarray(), columns = ['HH1 Agriculture/Fish','HH1 Contracted Worker','HH1 Driver', 'HH1 Family Provides',
 'HH1 Laundry / Servant', 'HH1 Laundry/Housekeeper', 'HH1 Lesiv/Servant', 'HH1 Nothing', 'HH1 Other', 'HH1 Paid Consistent Job', 'HH1 Small business outside or nearby the home','HH1 Student','HH1 Vendor'])

In [93]:
dfoccu.head()

Unnamed: 0,HH1 Agriculture/Fish,HH1 Contracted Worker,HH1 Driver,HH1 Family Provides,HH1 Laundry / Servant,HH1 Laundry/Housekeeper,HH1 Lesiv/Servant,HH1 Nothing,HH1 Other,HH1 Paid Consistent Job,HH1 Small business outside or nearby the home,HH1 Student,HH1 Vendor
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [94]:
df1_new['HH1 Occupation'].head()

0                             Agriculture/Fish
1                                       Vendor
2                                      Nothing
3                          Laundry/Housekeeper
4    Small business outside or nearby the home
Name: HH1 Occupation, dtype: object

### Combine new dataframes (sex, occupations) with main datafram (df1_new)

In [95]:
df1_new = df1_new.drop(['HH1 Sex', 'HH1 Occupation'], axis=1)
df2 = dfsex.join(dfoccu)
df2.shape

(526, 15)

In [96]:
df3 = df1_new.join(df2)
df3.shape
# (526, 66)
df3.sample(2)

Unnamed: 0,Tent ID,# of Residents less than 18 yr.,# of Residents more than 18 yr.,# of Tent Residents,# of Years Living in Village,Additional Comments - Living in Village,# of Years Living in Tent,# of People Living full time in tent,Education - School Attendance,Children Living Elsewhere,Ownership,Rent or Own Elsewhere,Previous Ownership,Marital Status,Family Bacgkround,Problems in the Tent - Additional Comments,Sleep # of People Per Night,Sleep - Difficulty Sleeping [Do you ever have issues sleeping?],Do you ever have trouble staying awake during the day?,Sleep Time,Wake-up Time,How often do you get sick?,Additional Comments - Health,"In the past year, did someone in this home suffer from cough, congestion or similar problems?","During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?","In the past month, did anyone living in the tent suffer from diarrhea?",Do you have access to a latrine?,Do you have electricity in your tent?,What is the main source of drinking water for members of your household?,How many gallons of water does your household use per day?,Do you ever drink water that isn't treated?,"Do you have any other comments, questions or other information you’d like to add?",Is there any risk that the tent will collapse?,In the past year did someone enter your house to steal something?,If you leave your tent are you concerned that someone ll steal from you?,"How often do you have friends, family or neighborhoods over to your tent?",Do you have space to lie down if tired?,Do people living in the tent have space to keep their personal belongings?,"In this tent, if someone wakes up, do they wake up the other people?",Do children have safe places to study?,Does your household own any animals?,Does the household own a radio?,Would living in a block home create any changes in your life?,Do you feel safe in your home?,Do you feel safe leaving your children alone at home?,Do you feel safe walking in the community at night?,Do you own the land the tent is on?,"If you were to receive a house, would you be willing to move to the area behind Healing Haiti?",What are the dwellings floors made of?,What are the dwellings roof made of?,Do you feel this person qualifies for a home?,HH1 Female,HH1 Male,HH1 Agriculture/Fish,HH1 Contracted Worker,HH1 Driver,HH1 Family Provides,HH1 Laundry / Servant,HH1 Laundry/Housekeeper,HH1 Lesiv/Servant,HH1 Nothing,HH1 Other,HH1 Paid Consistent Job,HH1 Small business outside or nearby the home,HH1 Student,HH1 Vendor
499,9,0,2,2,25,Mwen terete lartibonite lachepelle,7.0,2,No,No,Own,No,Rent a house,Other,Audio,Audio,2.0,Yes,No,9.0,8.0,Sometimes,Fyev tet femal tou,Yes,No,Yes,Yes,No,"Ponp oubyen Pi,Achte",8.0,Yes,Wi Bon nou ap jwen kob la fasil pou m achte,Yes,No,Yes,Sometimes,No,No,No,No,No,No,Audio,No,Yes,No,No,Yes,Dirt or soil,Tin,Yes,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
185,323,4,2,6,3,I've been here for 3 years,3.0,6,Yes,No,Own,No,Rent a house,Common Law,What led me to Titanyen is... We came here because of January 12th. Our house in Port au Prince was destroyed. We came here and rented a piece of land. That's why we came to TItanyen.,"Here in the tent I have many difficulties. It's a house only for when it's sunny. It's not a house for when it rains. For example, any amount of rain and all of the children have to stand up, and me as well, until the rain stops.",6.0,Yes,Yes,8.0,7.0,Sometimes,Aches and I'm pregnant,Yes,No,No,Yes,No,Pump or well,25.0,Yes,We drink water from the pump.,Yes,No,No,Sometimes,No,No,No,No,No,No,"To me it would be a joy, because when it rains all of the children have to stand up. I'd thank God and also you. If you were to give me a block house I would be overjoyed. I thank you very miuch.",Yes,No,Yes,No,Yes,Dirt or soil,Tin,Yes,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [97]:
ti_base_rec['HH2 Sex'].value_counts()
#hh2_sex.value_counts() #after converting string, 'NA' becomes one of the categories

HH2 Sex
Male      177
Female    154
Name: count, dtype: int64

In [98]:
#hh2_sex = ti_base_rec['HH2 Sex'].astype(str)
#hh2_sex_encoded = encoder.fit_transform(hh2_sex)
#hh2_sex_1hot=hencoder.fit_transform(hh2_sex_encoded.reshape(-1,1))
#hh2_sex_1hot
#'NA' is recognized as the 3rd category besides 'male' and 'female', thus 1hot encoding converts this column into 3 category
#columns. I only need 'male' and 'female' class. How to proceed?
#dfsex2 = pd.DataFrame(hh2_sex_1hot.toarray(), columns = ['HH2 Female', 'HH2 Male', 'NA']) #again, becasue 'NA' is recognized
#as 3rd class, I end up passing column of 3 classes to 2-column one hot dataframe.

In [99]:
#Better approach is taking care of the missing values. The columns with more than 30 NaN was dropped to create df1. 
#In the process, 'HH2 Sex' column was dropped with other columns. It's better to ignore HH2 Sex and HH2 Occupation 
#for the purpose of segmenting household data. 

### Next step: Taking care of the rest of the categorical data, filling NA, taking care of text data before running clustering algorithm on the dataframe.

In [100]:
df3['# of Years Living in Tent'].fillna(df3['# of Years Living in Tent'].median(), inplace = True)
df3['How many gallons of water does your household use per day?'].fillna(df3['How many gallons of water does your household use per day?'].median(), inplace = True)
df3['Sleep # of People Per Night'].fillna(df3['Sleep # of People Per Night'].median, inplace = True)

# How many gallons of water -> fill with median
# How to handle binally (Yes or No) or categorical data which have missing values? -> Create the third clumns which says 'No Response'
# Household with ID 16, 17, and 468 have missing values in many columns. In most cases, that is the only missing value
# for that column. Thus dropping these 3 observations entirely from the dataset helps algorithm run better.
#ti_base_rec.drop([16, 17, 468], axis=0, inplace = True)

### Feature engineer 'Sleep Length' 

In [101]:
df3['Sleep Time'] = df3['Sleep Time'] + 12
df3['Wake-up Time'] = df3['Wake-up Time'] + 24
df3['Wake-up Time'].sample(5)

95     30.0
323    30.0
495    29.0
257    29.0
158    30.0
Name: Wake-up Time, dtype: float64

In [102]:
df3['Sleep Length'] = df3['Wake-up Time'] - df3['Sleep Time']
df3['Sleep Length'].fillna(df3['Sleep Length'].median(), inplace = True)
df3['Sleep Length'].sample(5)

#Drop 2 columns from df3
df3= df3.drop(['Sleep Time','Wake-up Time'], axis=1)

### Replace NaN with 'No Response/Cannot Remember', or 'No'

In [103]:
df3.isnull().sum()

Tent ID                                                                                                 0
# of Residents less than 18 yr.                                                                         0
# of Residents more than 18 yr.                                                                         0
# of Tent Residents                                                                                     0
# of Years Living in Village                                                                            0
Additional Comments - Living in Village                                                                 2
# of Years Living in Tent                                                                               0
# of People Living full time in tent                                                                    0
Education - School Attendance                                                                           0
Children Living Elsewhere                     

In [104]:
df3['Additional Comments - Living in Village'][df3['Additional Comments - Living in Village'].isnull()]
#id 18 NaN -> 'No Response/Cannot Remember'
df3.loc[df3['Additional Comments - Living in Village'].isnull(),'Additional Comments - Living in Village'] = 'No Response/Cannot Remember'
df3['Previous Ownership'][df3['Previous Ownership'].isnull()]
#id 46, 206, 208 NaN -> 'No Response/Cannot Remember'. The column become categorical data
df3.loc[df3['Previous Ownership'].isnull(),'Previous Ownership'] = 'No Response/Cannot Remember'
# Drop 'Wake-up Time'
df3['Do you ever drink water that isn\'t treated?'][df3['Do you ever drink water that isn\'t treated?'].isnull()]
#id 18, 19, 21,22,24 NaN -> 'No Response/Cannot Remember'
df3.loc[df3['Do you ever drink water that isn\'t treated?'].isnull(),'Do you ever drink water that isn\'t treated?'] = 'No Response/Cannot Remember'
df3['Do you have any other comments, questions or other information you’d like to add?'][df3['Do you have any other comments, questions or other information you’d like to add?'].isnull()]
#id 182 NaN -> 'No Response/Cannot Remember'
df3.loc[df3['Do you have any other comments, questions or other information you’d like to add?'].isnull(),'Do you have any other comments, questions or other information you’d like to add?'] = 'No Response/Cannot Remember'
df3['In this tent, if someone wakes up, do they wake up the other people?'][df3['In this tent, if someone wakes up, do they wake up the other people?'].isnull()]
#id 13, 14 NaN -> 'No Respense/Cannot Remember'
df3.loc[df3['In this tent, if someone wakes up, do they wake up the other people?'].isnull(),'In this tent, if someone wakes up, do they wake up the other people?'] = 'No Response/Cannot Remember'
df3['If you were to receive a house, would you be willing to move to the area behind Healing Haiti?'][df3['If you were to receive a house, would you be willing to move to the area behind Healing Haiti?'].isnull()]
#id 0 - 15, 18, 19, 21, 22 NaN -> 'No Response/Cannot Remember'
df3.loc[df3['If you were to receive a house, would you be willing to move to the area behind Healing Haiti?'].isnull(),'If you were to receive a house, would you be willing to move to the area behind Healing Haiti?'] = 'No Response/Cannot Remember'
df3.loc[df3['Children Living Elsewhere'].isnull(),'Children Living Elsewhere']='No'
df3.loc[df3['How often do you get sick?'].isnull(),'How often do you get sick?'] = 'No Response/Cannot Remember'
df3.loc[df3['Additional Comments - Health'].isnull(),'Additional Comments - Health'] = 'No Response/Cannot Remember'
df3.loc[df3['In the past year, did someone in this home suffer from cough, congestion or similar problems?'].isnull(),'In the past year, did someone in this home suffer from cough, congestion or similar problems?'] = 'No'
df3.loc[df3['During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?'].isnull(),'During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?'] = 'No'
df3.loc[df3['In the past month, did anyone living in the tent suffer from diarrhea?'].isnull(),'In the past month, did anyone living in the tent suffer from diarrhea?'] = 'No Response/Cannot Remember'
df3.loc[df3['Do you have access to a latrine?'].isnull(),'Do you have access to a latrine?'] = 'No'
df3.loc[df3['Do you have electricity in your tent?'].isnull(),'Do you have electricity in your tent?'] = 'No'
df3.loc[df3['Is there any risk that the tent will collapse?'].isnull(),'Is there any risk that the tent will collapse?'] = 'No'
df3.loc[df3['In the past year did someone enter your house to steal something?'].isnull(),'In the past year did someone enter your house to steal something?'] = 'No'
df3.loc[df3['How often do you have friends, family or neighborhoods over to your tent?'].isnull(),'How often do you have friends, family or neighborhoods over to your tent?'] = 'No Response/Cannot Remember'
df3.loc[df3['Do you have space to lie down if tired?'].isnull(),'Do you have space to lie down if tired?'] = 'No'
df3.loc[df3['Do people living in the tent have space to keep their personal belongings?'].isnull(),'Do people living in the tent have space to keep their personal belongings?'] = 'No'
df3.loc[df3['Do children have safe places to study?'].isnull(),'Do children have safe places to study?'] = 'No'
df3.loc[df3['Do you feel safe leaving your children alone at home?'].isnull(),'Do you feel safe leaving your children alone at home?'] = 'No'
df3.loc[df3['Do you own the land the tent is on?'].isnull(),'Do you own the land the tent is on?'] = 'No'
df3.loc[df3['If you leave your tent are you concerned that someone ll steal from you?'].isnull(),'If you leave your tent are you concerned that someone ll steal from you?'] = 'No'
df3.loc[df3['What is the main source of drinking water for members of your household?'].isnull(),'What is the main source of drinking water for members of your household?'] = 'No'

### Extract information from features with a lot of missing values (more than 30 NaN)

In [105]:
# Do some investigation to learn what to do with the columns with more than 30 missing values
condition = (df1.isnull().sum() >= 30)
df1_alt = df1.loc[:, condition]

In [106]:
df1_alt.isnull().sum().sort_values(ascending = True)
# Basic approach is as follows:
## Will'rescue'columns with less than 50% missing values and with useful information
## Will reduce number of rescued columns by combining multiple colmns or deleting redundence.

# Here's how to achive above:
## Full Time Tent Resident (R1 - R8) can be ignored because only useful info is 
## 1) Total number of the residents in one household (already known)
## 2) Number of residents in the household who work
## Therefore after calculating the metrics 2), 72 related columns can be deleted from database.
## Columns related to Sleep Difficulty can be combined into one, a binary column indicating if they have trouble or not 
## Columns in which more than 50% of data are missing simply cannot be useful for the analysis

Additional Comments - Home Qualification                                                      108
Sleep - Frequency [How frequently do you have issues sleeping?]                               111
Do you normally wake up at night?                                                             112
Additional Comments - Land Tenure                                                             166
Additional Comments - Latrine                                                                 188
Marital Status - # of Years Together                                                          194
HH2 Sex                                                                                       195
HH2 Occupation                                                                                202
In the past week, how often did you have trouble staying awake?                               213
Head of Household 1                                                                           242
What is the main thi

In [107]:
df1_alt.loc[:,['HH2 Occupation','R1 Occupation','R2 Occupation','R3 Occupation','R4 Occupation','R5 Occupation','R6 Occupation','R7 Occupation','R8 Occupation']][:20]
# If NaN, Nothing, or Student then 'No'. Otherwise 'Yes'. For each household, count the number of residents with 'Yes'
# in Occupation columns, create the new column called 'Number of residents who work'
resident_work = df1_alt.loc[:,['HH2 Occupation','R1 Occupation','R2 Occupation','R3 Occupation','R4 Occupation','R5 Occupation','R6 Occupation','R7 Occupation','R8 Occupation']]
resident_work[:20]

Unnamed: 0,HH2 Occupation,R1 Occupation,R2 Occupation,R3 Occupation,R4 Occupation,R5 Occupation,R6 Occupation,R7 Occupation,R8 Occupation
0,,,,,,,,,
1,,,,,,,,,
2,,,,,,,,,
3,,,,,,,,,
4,,,,,,,,,
5,,Student,Nothing,,,,,,
6,,,,,,,,,
7,,Student,Student,,,,,,
8,,Student,Student,,,,,,
9,Nothing,Paid Consistent Job,Student,Student,Student,Student,,,


In [108]:
#resident_work.where(resident_work.notnull(), 'No', inplace = True)
resident_work.replace(['No','Nothing','Student','Elèv','Anyen'], 0, inplace = True)
resident_work.replace(['Agriculture/Fish','Contracted Worker','Driver','Family Provides','Laundry / Servant','Laundry/Housekeeper','Lesiv/Servant',
'Other','Lòt','Paid Consistent Job','Small business outside or nearby the home','Vendor','Mwen travay kòm yon kontraktè','Chofè','Travay late/lapèch','Ti komès'],1, inplace=True)

In [109]:
resident_work[:20]

Unnamed: 0,HH2 Occupation,R1 Occupation,R2 Occupation,R3 Occupation,R4 Occupation,R5 Occupation,R6 Occupation,R7 Occupation,R8 Occupation
0,,,,,,,,,
1,,,,,,,,,
2,,,,,,,,,
3,,,,,,,,,
4,,,,,,,,,
5,,0.0,0.0,,,,,,
6,,,,,,,,,
7,,0.0,0.0,,,,,,
8,,0.0,0.0,,,,,,
9,0.0,1.0,0.0,0.0,0.0,0.0,,,


In [110]:
working = resident_work.sum(axis =1, ) # create pandas series of subtotals across columns
workingresidents = pd.DataFrame({'Number of Residents with Income': working.values}) #number of residents with income besides house head

In [111]:
workingresidents[:10]

Unnamed: 0,Number of Residents with Income
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
5,0.0
6,0.0
7,0.0
8,0.0
9,1.0


In [112]:
df1_alt.loc[:,['Sleep - Frequency [How frequently do you have issues sleeping?]','Do you normally wake up at night?','What are the top two reasons you wake up at night or have issues the sleep?','Normally, how frequently do you wake up night?']][:10]
# If 'Sleep Frequency' is blank, other sleep related columns are also blank for that household. 'Sleep Frequency' and 'Do you normally wake up at night?'
# have enough representation of sleep issue. No need to use other 3 columns. Converte NaN to 'No Response/Cannot Remember'.

Unnamed: 0,Sleep - Frequency [How frequently do you have issues sleeping?],Do you normally wake up at night?,What are the top two reasons you wake up at night or have issues the sleep?,"Normally, how frequently do you wake up night?"
0,Very often,Yes,It’s hot or cold,Very often
1,Very often,Yes,"It’s hot or cold, Anxiety and worry about my life",
2,Very often,Yes,"It’s hot or cold, Annoyance by nd coming into the tent, Noise Outside (animals, people), Fear of Structural Housing Problems",Very often
3,Very often,Yes,"It’s hot or cold, Annoyance by nd coming into the tent, watching over the younger children",Very often
4,Very often,Yes,"It’s hot or cold, worring about the problems th this tent, worry and fear",Very often
5,Very often,Yes,"It’s hot or cold, Annoyance by nd coming into the tent",Sometimes
6,Very often,Yes,"It’s hot or cold,Annoyance by nd coming into the tent, Noise Outside (animals, people)",Very often
7,Sometimes,Yes,"The heat or the cold, the wind causing problems.",Very often
8,Very often,Yes,"The heat or the cold, the wind causing problems.",Very often
9,Sometimes,Yes,"The heat or the cold, the wind causing problems.",Sometimes


In [113]:
# 2 fields related to sleep difficulty 'Do you EVER have isues sleeping?' & 'Do you EVER have trouble staying awake?'
# has no missing values. Process these 2 columns seperately from the rest columns.
df3.loc[:,['Sleep - Difficulty Sleeping [Do you ever have issues sleeping?]','Do you ever have trouble staying awake during the day?',]][:10]

Unnamed: 0,Sleep - Difficulty Sleeping [Do you ever have issues sleeping?],Do you ever have trouble staying awake during the day?
0,Yes,Yes
1,Yes,Yes
2,Yes,Yes
3,Yes,Yes
4,Yes,Yes
5,Yes,Yes
6,Yes,Yes
7,Yes,Yes
8,Yes,Yes
9,Yes,Yes


In [114]:
df3.loc[:,['Sleep - Difficulty Sleeping [Do you ever have issues sleeping?]','Do you ever have trouble staying awake during the day?',]].describe()
#'Do you ever have issues sleeping Yes 414 No 109  'Do you ever have troublr staying awake? Yes 312 No 211
# Give yes 1 no 0, add up the scores and combine with sleep_condition column which is derived below

Unnamed: 0,Sleep - Difficulty Sleeping [Do you ever have issues sleeping?],Do you ever have trouble staying awake during the day?
count,525,525
unique,2,2
top,Yes,Yes
freq,416,313


In [115]:
df1.loc[:,['Sleep - Difficulty Sleeping [Do you ever have issues sleeping?]','Do you normally wake up at night?','Normally, how frequently do you wake up night?']][:20]

Unnamed: 0,Sleep - Difficulty Sleeping [Do you ever have issues sleeping?],Do you normally wake up at night?,"Normally, how frequently do you wake up night?"
0,Yes,Yes,Very often
1,Yes,Yes,
2,Yes,Yes,Very often
3,Yes,Yes,Very often
4,Yes,Yes,Very often
5,Yes,Yes,Sometimes
6,Yes,Yes,Very often
7,Yes,Yes,Very often
8,Yes,Yes,Very often
9,Yes,Yes,Sometimes


In [116]:
result = pd.concat([df1.loc[:,['Do you ever have trouble staying awake during the day?']], df1_alt.loc[:,['In the past week, how often did you have trouble staying awake?']]], axis=1)
#df1_alt.loc[:,'In the past week, how often did you have trouble staying awake? ']
result.head(10)

Unnamed: 0,Do you ever have trouble staying awake during the day?,"In the past week, how often did you have trouble staying awake?"
0,Yes,3-4 times
1,Yes,1-2 times
2,Yes,1-2 times
3,Yes,3-4 times
4,Yes,1-2 times
5,Yes,1-2 times
6,Yes,3-4 times
7,Yes,1-2 times
8,Yes,3-4 times
9,Yes,1-2 times


In [117]:
# Replace NA with 'No Response/Cannot Remember'
df1_alt.loc[df1_alt['Sleep - Frequency [How frequently do you have issues sleeping?]'].isnull(),'Sleep - Frequency [How frequently do you have issues sleeping?]'] = 'No'
df1_alt.loc[df1_alt['Do you normally wake up at night?'].isnull(),'Do you normally wake up at night?'] = 'No'
df1_alt.loc[df1_alt['In the past week, how often did you have trouble staying awake?'].isnull(),'In the past week, how often did you have trouble staying awake?'] = 'No'
df1_alt.loc[df1_alt['Normally, how frequently do you wake up night?'].isnull(),'Normally, how frequently do you wake up night?'] = 'No'

In [118]:
filter1 = df1_alt['Do you normally wake up at night?']=='No'
df1_alt['Do you normally wake up at night?'].where(filter1, df1_alt['Normally, how frequently do you wake up night?'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1_alt['Do you normally wake up at night?'].where(filter1, df1_alt['Normally, how frequently do you wake up night?'], inplace=True)


In [119]:
df1_alt['Do you normally wake up at night?'].unique()
# 'Very Often' - 5, 'Sometimes' - 4, 'No Response' - 3, 'Rarely' - 2, 'No' - 1
# This is the only columns to be used for 'Trouble of Sleeping' criteria

array(['Very often', 'No', 'Sometimes', 'Rarely'], dtype=object)

In [120]:
df1_alt['Sleep - Frequency [How frequently do you have issues sleeping?]'].unique()
# 'Very Often - 4, 'Sometimes' - 3, 'No Response' - 2, Rarely - 1
df1_alt['In the past week, how often did you have trouble staying awake?'].unique()

array(['3-4 times', '1-2 times', 'No', '5-6 times', '7 or more times'],
      dtype=object)

In [121]:
filter2 = result['Do you ever have trouble staying awake during the day?']=='No'
result['Do you ever have trouble staying awake during the day?'].where(filter2, df1_alt['In the past week, how often did you have trouble staying awake?'], inplace=True)

In [122]:
# Drop other sleep related columns from the database
df1_alt.drop(['Sleep - Frequency [How frequently do you have issues sleeping?]','What are the top two reasons you wake up at night or have issues the sleep?', 'In the past week, how often did you have trouble staying awake?','Normally, how frequently do you wake up night?'], axis=1, inplace=True)
sleep_condition = pd.concat([df1_alt['Do you normally wake up at night?'], result['Do you ever have trouble staying awake during the day?']], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1_alt.drop(['Sleep - Frequency [How frequently do you have issues sleeping?]','What are the top two reasons you wake up at night or have issues the sleep?', 'In the past week, how often did you have trouble staying awake?','Normally, how frequently do you wake up night?'], axis=1, inplace=True)


In [123]:
sleep_condition[:20]

Unnamed: 0,Do you normally wake up at night?,Do you ever have trouble staying awake during the day?
0,Very often,3-4 times
1,No,1-2 times
2,Very often,1-2 times
3,Very often,3-4 times
4,Very often,1-2 times
5,Sometimes,1-2 times
6,Very often,3-4 times
7,Very often,1-2 times
8,Very often,3-4 times
9,Sometimes,1-2 times


In [124]:
new = workingresidents.join(sleep_condition)
df3.drop(['Do you ever have trouble staying awake during the day?','Sleep - Difficulty Sleeping [Do you ever have issues sleeping?]'], axis=1, inplace=True)

In [125]:
df4 = df3.join(new)

In [126]:
#df4 = df4.apply(lambda x: x.str.strip() if x.dtype =='object' else x) # This seems not working. Why?
df4.tail()
df4.isnull().sum()

# 1)'# of residents older than 18 yr.' is not necessary 2) 'Additional Comments Living in Village' > Do we need this? 
# 3) '# of people living full time in tent' > redundant? 4) 'Sleep # of people per night' > redundant 5) 'Sleep Difficulty'
# & 'Do you ever have trouble staying awake' > Drop them

# 6) Run text clustering for 'Family Background' 7) 'Additional Comments Health' half of the comments are in Creole
# 8) 'Do you have any other comments, questions or other information you’d like to add?' > text clustering
# 9) 'How often do you have friends, family or neighborhoods over to your tent?' Often/Frequently 3 Sometimes 2
# No Response 1 Never 0

Tent ID                                                                                                0
# of Residents less than 18 yr.                                                                        0
# of Residents more than 18 yr.                                                                        0
# of Tent Residents                                                                                    0
# of Years Living in Village                                                                           0
Additional Comments - Living in Village                                                                0
# of Years Living in Tent                                                                              0
# of People Living full time in tent                                                                   0
Education - School Attendance                                                                          0
Children Living Elsewhere                              

In [127]:
df4.shape

(526, 66)

In [128]:
df4.drop(['# of Residents more than 18 yr.','# of People Living full time in tent','Sleep # of People Per Night','Additional Comments - Living in Village'], axis=1, inplace=True)

In [129]:
# Label Encoding
df4['What are the dwellings floors made of?'].replace(['Mozayik-Seramik, Planch','Ceramic'],'Tile ceramic or wood planks', inplace=True)

In [130]:
df4['How often do you have friends, family or neighborhoods over to your tent?'].replace(['Often', 'Frequently', 'Sometimes','No Response/Cannot Remember','Never'],[3,3,2,1,0], inplace=True)

In [131]:
df4['How often do you get sick?'].replace(['Very often','Sometimes','No Reseponse/Can\'t Remember','No Response/Cannot Remember','Rarely'],[4,3,2,2,1], inplace=True)

In [132]:
df4['Do you normally wake up at night?'].value_counts()
df4['Do you normally wake up at night?'].replace(['Very often','Sometimes','Rarely','No'],[3,2,1,0], inplace=True)

In [133]:
df4['Do you ever have trouble staying awake during the day?'].value_counts()
df4['Do you ever have trouble staying awake during the day?'].replace(['7 or more times','5-6 times','3-4 times','1-2 times','No'],[4,3,2,1,0], inplace=True)

In [134]:
df4['Does your household own any animals?'].where(df4['Does your household own any animals?']=='No', 'Yes', inplace = True)

In [135]:
df4['Rent or Own Elsewhere'].value_counts()
df4_elsew = df4['Rent or Own Elsewhere']
df4_elsew_encoded = encoder.fit_transform(df4_elsew)

df4_elsew_1hot=hencoder.fit_transform(df4_elsew_encoded.reshape(-1,1))
elsew = pd.DataFrame(df4_elsew_1hot.toarray(), columns = ['Rent or Own Elsewhere No','Rent or Own Elsewhere Yes'])

In [136]:
print(elsew.shape, df4_elsew.shape, df4.shape)

(526, 2) (526,) (526, 62)


In [137]:
df4['Education - School Attendance'].unique()
df4_ed = df4['Education - School Attendance']
df4_ed_encoded = encoder.fit_transform(df4_ed)

df4_ed_1hot=hencoder.fit_transform(df4_ed_encoded.reshape(-1,1))
ed = pd.DataFrame(df4_ed_1hot.toarray(), columns = ['Kids Education No','Kids Education No Response','Kids Education Some','Kids Education Yes'])

In [138]:
df4_child = df4['Children Living Elsewhere']
df4_child_encoded = encoder.fit_transform(df4_child)

df4_child_1hot=hencoder.fit_transform(df4_child_encoded.reshape(-1,1))
dfchil = pd.DataFrame(df4_child_1hot.toarray(), columns = ['Children Living Elsewhere No','Children Living Elsewhere Yes'])


In [139]:
df4['Ownership'].unique()

df4_own = df4['Ownership']
df4_own_encoded = encoder.fit_transform(df4_own)

df4_own_1hot=hencoder.fit_transform(df4_own_encoded.reshape(-1,1))
dfown = pd.DataFrame(df4_own_1hot.toarray(), columns = ['Ownership Other','Ownership Yes','Ownership Rent'])

In [140]:
df4['Previous Ownership'].unique()
df4_pown = df4['Previous Ownership']
df4_pown_encoded = encoder.fit_transform(df4_pown)

df4_pown_1hot=hencoder.fit_transform(df4_pown_encoded.reshape(-1,1))
dfpown = pd.DataFrame(df4_pown_1hot.toarray(), columns = ['Previous Ownership Live with Family','Previous Ownership No Response','Previous Ownership Other','Previous Ownership Own House','Previous Ownership Rent'])

In [141]:
df4['Marital Status'].unique()
df4_marry = df4['Marital Status']
df4_marry_encoded = encoder.fit_transform(df4_marry)

df4_marry_1hot=hencoder.fit_transform(df4_marry_encoded.reshape(-1,1))
dfmarry = pd.DataFrame(df4_marry_1hot.toarray(), columns = ['Marital Status Common Law','Marital Status Married','Marital Status Other','Marital Status Single','Marital Status Widow'])

In [142]:
df4['In the past year, did someone in this home suffer from cough, congestion or similar problems?'].unique()
df4.loc[df4['In the past year, did someone in this home suffer from cough, congestion or similar problems?'].isnull(),'In the past year, did someone in this home suffer from cough, congestion or similar problems?']='No'
df4_cough = df4['In the past year, did someone in this home suffer from cough, congestion or similar problems?']
df4_cough_encoded = encoder.fit_transform(df4_cough)

df4_cough_1hot=hencoder.fit_transform(df4_cough_encoded.reshape(-1,1))
dfcough = pd.DataFrame(df4_cough_1hot.toarray(), columns = ['Suffer from cough or congestion? No','Suffer from cough or congestion? Yes'])

In [143]:
df4['During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?'].value_counts()
df4['During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?'].replace('No Response/Cannot Remember','No',inplace=True)
df4_pneumonia = df4['During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?']
df4_pneumonia_encoded = encoder.fit_transform(df4_pneumonia)

df4_pneumonia_1hot=hencoder.fit_transform(df4_pneumonia_encoded.reshape(-1,1))
dfpneumonia = pd.DataFrame(df4_pneumonia_1hot.toarray(), columns = ['Suffer from bronchitis or pneumonia? No','Suffer from bronchitis or pneumonia? Yes'])

In [144]:
df4['In the past month, did anyone living in the tent suffer from diarrhea?'].unique()
df4_diarrhea = df4['In the past month, did anyone living in the tent suffer from diarrhea?']
df4_diarrhea_encoded = encoder.fit_transform(df4_diarrhea)

df4_diarrhea_1hot=hencoder.fit_transform(df4_diarrhea_encoded.reshape(-1,1))
diarrhea = pd.DataFrame(df4_diarrhea_1hot.toarray(), columns = ['Suffer from diarrhea? No','Suffer from diarrhea? No Response','Suffer from diarrhea? Yes'])

In [145]:
df4['Do you have access to a latrine?'].unique()
df4_lat = df4['Do you have access to a latrine?']
df4_lat_encoded = encoder.fit_transform(df4_lat)

df4_lat_1hot=hencoder.fit_transform(df4_lat_encoded.reshape(-1,1))
lat = pd.DataFrame(df4_lat_1hot.toarray(), columns = ['Do you have access to a latrine? No','Do you have access to a latrine? Yes'])

In [146]:
df4['Do you have electricity in your tent?'].unique()
df4_elec = df4['Do you have electricity in your tent?']
df4_elec_encoded = encoder.fit_transform(df4_elec)

df4_elec_1hot=hencoder.fit_transform(df4_elec_encoded.reshape(-1,1))
elec = pd.DataFrame(df4_elec_1hot.toarray(), columns = ['Do you have electricity in your tent? No','Do you have electricity in your tent? Yes'])

In [147]:
df4['What is the main source of drinking water for members of your household?'].value_counts()
df4['What is the main source of drinking water for members of your household?'].replace(['Ponp oubyen Pi','Buy from the water truck','Achte nan machine dlo','Ponp oubyen Pi,Achte','Ponp oubyen Pi,Achte nan machine dlo'],['Pump or well','Buy','Buy','Pump, well, or buy','Pump, well, or buy'], inplace=True)
df4_water = df4['What is the main source of drinking water for members of your household?']
df4_water_encoded = encoder.fit_transform(df4_water)

df4_water_1hot=hencoder.fit_transform(df4_water_encoded.reshape(-1,1))
water = pd.DataFrame(df4_water_1hot.toarray(), columns = ['Main source of drinking water:Buy','Main source of drinking water:No','Main source of drinking water:Pump or well','Main source of drinking water:Pump, well, or buy'])

In [148]:
df4['Do you ever drink water that isn\'t treated?'].value_counts()
df4_treat = df4['Do you ever drink water that isn\'t treated?']
df4_treat_encoded = encoder.fit_transform(df4_treat)

df4_treat_1hot=hencoder.fit_transform(df4_treat_encoded.reshape(-1,1))
treat = pd.DataFrame(df4_treat_1hot.toarray(), columns = ['Do you ever drink water that isn\'t treated? No Response','Do you ever drink water that isn\'t treated? No','Do you ever drink water that isn\'t treated? Yes'])

In [149]:
df4['Is there any risk that the tent will collapse?'].unique()
df4_risk = df4['Is there any risk that the tent will collapse?']
df4_risk_encoded = encoder.fit_transform(df4_risk)

df4_risk_1hot=hencoder.fit_transform(df4_risk_encoded.reshape(-1,1))
risk = pd.DataFrame(df4_risk_1hot.toarray(), columns = ['Is there any risk that the tent will collapse? No','Is there any risk that the tent will collapse? Yes'])
risk.sample(2)

Unnamed: 0,Is there any risk that the tent will collapse? No,Is there any risk that the tent will collapse? Yes
511,0.0,1.0
391,0.0,1.0


In [150]:
df4['In the past year did someone enter your house to steal something?'].value_counts()
df4_steal = df4['In the past year did someone enter your house to steal something?']
df4_steal_encoded = encoder.fit_transform(df4_steal)

df4_steal_1hot=hencoder.fit_transform(df4_steal_encoded.reshape(-1,1))
steal = pd.DataFrame(df4_steal_1hot.toarray(), columns = ['In the past year did someone enter your house to steal something? No','In the past year did someone enter your house to steal something? Yes'])
steal.sample(2)

Unnamed: 0,In the past year did someone enter your house to steal something? No,In the past year did someone enter your house to steal something? Yes
41,0.0,1.0
40,1.0,0.0


In [151]:
df4['Do you have space to lie down if tired?'].value_counts()

df4_lie = df4['Do you have space to lie down if tired?']
df4_lie_encoded = encoder.fit_transform(df4_lie)

df4_lie_1hot=hencoder.fit_transform(df4_lie_encoded.reshape(-1,1))
lie = pd.DataFrame(df4_lie_1hot.toarray(), columns = ['Do you have space to lie down if tired? No','Do you have space to lie down if tired? Yes'])
lie.sample(2)

Unnamed: 0,Do you have space to lie down if tired? No,Do you have space to lie down if tired? Yes
406,1.0,0.0
189,1.0,0.0


In [152]:
df4['Do people living in the tent have space to keep their personal belongings?'].value_counts()
df4_belong = df4['Do people living in the tent have space to keep their personal belongings?']
df4_belong_encoded = encoder.fit_transform(df4_belong)

df4_belong_1hot=hencoder.fit_transform(df4_belong_encoded.reshape(-1,1))
belong = pd.DataFrame(df4_belong_1hot.toarray(), columns = ['Do people living in the tent have space to keep their personal belongings? No','Do people living in the tent have space to keep their personal belongings? Yes'])
belong.sample(2)

Unnamed: 0,Do people living in the tent have space to keep their personal belongings? No,Do people living in the tent have space to keep their personal belongings? Yes
317,0.0,1.0
325,1.0,0.0


In [153]:
df4['In this tent, if someone wakes up, do they wake up the other people?'].value_counts()
df4['In this tent, if someone wakes up, do they wake up the other people?'].replace('No Response/Cannot Remember','No', inplace=True)
df4_wake = df4['In this tent, if someone wakes up, do they wake up the other people?']
df4_wake_encoded = encoder.fit_transform(df4_wake)

df4_wake_1hot=hencoder.fit_transform(df4_wake_encoded.reshape(-1,1))
wake = pd.DataFrame(df4_wake_1hot.toarray(), columns = ['In this tent, if someone wakes up, do they wake up the other people? No','In this tent, if someone wakes up, do they wake up the other people? Yes'])
wake.sample(2)

Unnamed: 0,"In this tent, if someone wakes up, do they wake up the other people? No","In this tent, if someone wakes up, do they wake up the other people? Yes"
138,1.0,0.0
153,1.0,0.0


In [154]:
df4['Do children have safe places to study?'].value_counts()
df4['Do children have safe places to study?'].replace('No Response/Cannot Remember','No', inplace=True)
df4_chilsafe = df4['Do children have safe places to study?']
df4_chilsafe_encoded = encoder.fit_transform(df4_chilsafe)

df4_chilsafe_1hot=hencoder.fit_transform(df4_chilsafe_encoded.reshape(-1,1))
chilsafe = pd.DataFrame(df4_chilsafe_1hot.toarray(), columns = ['Do children have safe places to study? No','Do children have safe places to study? Yes'])
chilsafe.sample(2)

Unnamed: 0,Do children have safe places to study? No,Do children have safe places to study? Yes
85,1.0,0.0
483,1.0,0.0


In [155]:
df4['Does your household own any animals?'].value_counts()
df4_animal = df4['Does your household own any animals?']
df4_animal_encoded = encoder.fit_transform(df4_animal)

df4_animal_1hot=hencoder.fit_transform(df4_animal_encoded.reshape(-1,1))
animal = pd.DataFrame(df4_animal_1hot.toarray(), columns = ['Does your household own any animals? No','Does your household own any animals? Yes'])
animal.sample(2)

Unnamed: 0,Does your household own any animals? No,Does your household own any animals? Yes
214,1.0,0.0
435,1.0,0.0


In [156]:
df4['Does the household own a radio?'].value_counts()
df4_radio = df4['Does the household own a radio?']
df4_radio_encoded = encoder.fit_transform(df4_radio)

df4_radio_1hot=hencoder.fit_transform(df4_radio_encoded.reshape(-1,1))
radio = pd.DataFrame(df4_radio_1hot.toarray(), columns = ['Does your household own a radio? No','Does your household own a radio? Yes'])
radio.sample(2)

Unnamed: 0,Does your household own a radio? No,Does your household own a radio? Yes
252,1.0,0.0
28,0.0,1.0


In [157]:
df4['Do you feel safe in your home?'].value_counts()
df4_home = df4['Do you feel safe in your home?']
df4_home_encoded = encoder.fit_transform(df4_home)

df4_home_1hot=hencoder.fit_transform(df4_home_encoded.reshape(-1,1))
home = pd.DataFrame(df4_home_1hot.toarray(), columns = ['Do you feel safe in your home? No','Do you feel safe in your home? Yes'])
home.sample(2)

Unnamed: 0,Do you feel safe in your home? No,Do you feel safe in your home? Yes
10,0.0,1.0
209,0.0,1.0


In [158]:
df4['Do you feel safe leaving your children alone at home?'].value_counts()
df4_homechil = df4['Do you feel safe leaving your children alone at home?']
df4_homechil_encoded = encoder.fit_transform(df4_homechil)

df4_homechil_1hot=hencoder.fit_transform(df4_homechil_encoded.reshape(-1,1))
homechil = pd.DataFrame(df4_homechil_1hot.toarray(), columns = ['Do you feel safe leaving your children alone at home? No','Do you feel safe leaving your children alone at home? Yes'])
homechil.sample(2)

Unnamed: 0,Do you feel safe leaving your children alone at home? No,Do you feel safe leaving your children alone at home? Yes
319,1.0,0.0
193,1.0,0.0


In [159]:
df4['Do you feel safe walking in the community at night?'].value_counts()
df4_community = df4['Do you feel safe walking in the community at night?']
df4_community_encoded = encoder.fit_transform(df4_community)

df4_community_1hot=hencoder.fit_transform(df4_community_encoded.reshape(-1,1))
community = pd.DataFrame(df4_community_1hot.toarray(), columns = ['Do you feel safe walking in the community at night? No','Do you feel safe walking in the community at night? Yes'])
community.sample(2)

Unnamed: 0,Do you feel safe walking in the community at night? No,Do you feel safe walking in the community at night? Yes
332,1.0,0.0
115,0.0,1.0


In [160]:
df4['Do you own the land the tent is on?'].value_counts()
df4_land = df4['Do you own the land the tent is on?']
df4_land_encoded = encoder.fit_transform(df4_land)

df4_land_1hot=hencoder.fit_transform(df4_land_encoded.reshape(-1,1))
land = pd.DataFrame(df4_land_1hot.toarray(), columns = ['Do you own the land the tent is on? No','Do you own the land the tent is on? Yes'])
land.sample(2)

Unnamed: 0,Do you own the land the tent is on? No,Do you own the land the tent is on? Yes
373,1.0,0.0
26,0.0,1.0


In [161]:
df4['If you were to receive a house, would you be willing to move to the area behind Healing Haiti?'].value_counts()
df4['If you were to receive a house, would you be willing to move to the area behind Healing Haiti?'].replace('No Response/Cannot Remember','No',inplace=True)
df4_move = df4['If you were to receive a house, would you be willing to move to the area behind Healing Haiti?']
df4_move_encoded = encoder.fit_transform(df4_move)

df4_move_1hot=hencoder.fit_transform(df4_move_encoded.reshape(-1,1))
move = pd.DataFrame(df4_move_1hot.toarray(), columns = ['If you were to receive a house, would you be willing to move to the area behind Healing Haiti? No','If you were to receive a house, would you be willing to move to the area behind Healing Haiti? Yes'])
move.sample(2)

Unnamed: 0,"If you were to receive a house, would you be willing to move to the area behind Healing Haiti? No","If you were to receive a house, would you be willing to move to the area behind Healing Haiti? Yes"
498,0.0,1.0
323,0.0,1.0


In [162]:
df4['Do you feel this person qualifies for a home?'].value_counts()
df4_qual = df4['Do you feel this person qualifies for a home?']
df4_qual_encoded = encoder.fit_transform(df4_qual)

df4_qual_1hot=hencoder.fit_transform(df4_qual_encoded.reshape(-1,1))
qual = pd.DataFrame(df4_qual_1hot.toarray(), columns = ['Do you feel this person qualifies for a home? No','Do you feel this person qualifies for a home? Yes'])
qual.sample(2)

Unnamed: 0,Do you feel this person qualifies for a home? No,Do you feel this person qualifies for a home? Yes
405,0.0,1.0
9,0.0,1.0


In [163]:
df4['What are the dwellings floors made of?'].value_counts()
# Dirt or Soil floors are the worst
df4['What are the dwellings floors made of?'].replace(['Tile ceramic or wood planks','Concrete','Other'],'Ceramic, wood, concrete', inplace=True)
df4_floor = df4['What are the dwellings floors made of?']
df4_floor_encoded = encoder.fit_transform(df4_floor)

df4_floor_1hot=hencoder.fit_transform(df4_floor_encoded.reshape(-1,1))
floor = pd.DataFrame(df4_floor_1hot.toarray(), columns = ['What are the dwellings floors made of? Ceramic, wood, concrete','What are the dwellings floors made of? Dirt or soil'])
floor.sample(2)

Unnamed: 0,"What are the dwellings floors made of? Ceramic, wood, concrete",What are the dwellings floors made of? Dirt or soil
161,0.0,1.0
86,0.0,1.0


In [164]:
df4['What are the dwellings roof made of?'].value_counts()
# Tarp roof is the worst
df4_roof = df4['What are the dwellings roof made of?']
df4_roof_encoded = encoder.fit_transform(df4_roof)

df4_roof_1hot=hencoder.fit_transform(df4_roof_encoded.reshape(-1,1))
roof = pd.DataFrame(df4_roof_1hot.toarray(), columns = ['What are the dwellings roof made of? Tarp','What are the dwellings roof made of? Tin'])
roof.sample(2)

Unnamed: 0,What are the dwellings roof made of? Tarp,What are the dwellings roof made of? Tin
344,0.0,1.0
154,0.0,1.0


In [165]:
df4['Underage kids in the family %']= df4['# of Residents less than 18 yr.']/df4['# of Tent Residents']
df4['Water Usage (gallon) per resident per day'] = df4['How many gallons of water does your household use per day?']/df4['# of Tent Residents']

In [166]:
df4.shape

(526, 64)

In [167]:
onehot= pd.concat([dfchil, elsew, ed, dfown, dfpown, dfmarry, dfcough, dfpneumonia, diarrhea, lat, elec, water, treat, risk, steal, lie, belong, wake, chilsafe, animal, radio, home, homechil, community, land, move, qual, floor, roof], axis=1)
onehot.sample(5)
onehot.shape

(526, 71)

In [168]:
onehot.isnull().sum()

Children Living Elsewhere No                                                                          0
Children Living Elsewhere Yes                                                                         0
Rent or Own Elsewhere No                                                                              0
Rent or Own Elsewhere Yes                                                                             0
Kids Education No                                                                                     0
Kids Education No Response                                                                            0
Kids Education Some                                                                                   0
Kids Education Yes                                                                                    0
Ownership Other                                                                                       0
Ownership Yes                                                   

In [169]:
df5 = df4.drop(['# of Residents less than 18 yr.','How many gallons of water does your household use per day?','Children Living Elsewhere','Rent or Own Elsewhere','Ownership','Previous Ownership','Marital Status','In the past year, did someone in this home suffer from cough, congestion or similar problems?','During the last year, anyone living in the tent suffer from bronchitis, bronchiolitis or pneumonia?','In the past month, did anyone living in the tent suffer from diarrhea?','Do you have access to a latrine?','Do you have electricity in your tent?','What is the main source of drinking water for members of your household?','Do you ever drink water that isn\'t treated?','Is there any risk that the tent will collapse?','In the past year did someone enter your house to steal something?','If you leave your tent are you concerned that someone ll steal from you?','Do you have space to lie down if tired?','In this tent, if someone wakes up, do they wake up the other people?','Do people living in the tent have space to keep their personal belongings?','Do children have safe places to study?','Does your household own any animals?','Does the household own a radio?','Do you feel safe in your home?','Do you feel safe leaving your children alone at home?','Do you feel safe walking in the community at night?','Do you own the land the tent is on?','If you were to receive a house, would you be willing to move to the area behind Healing Haiti?','Do you feel this person qualifies for a home?','What are the dwellings floors made of?','What are the dwellings roof made of?','Education - School Attendance'],axis=1)

In [170]:
df5.sample(2)
df5.isnull().sum()

Tent ID                                                                              0
# of Tent Residents                                                                  0
# of Years Living in Village                                                         0
# of Years Living in Tent                                                            0
Family Bacgkround                                                                    0
Problems in the Tent - Additional Comments                                           0
How often do you get sick?                                                           0
Additional Comments - Health                                                         0
Do you have any other comments, questions or other information you’d like to add?    0
How often do you have friends, family or neighborhoods over to your tent?            0
Would living in a block home create any changes in your life?                        0
HH1 Female                                 

In [171]:
df5.shape

(526, 32)

In [172]:
df5.index

RangeIndex(start=0, stop=526, step=1)

In [173]:
onehot.index

RangeIndex(start=0, stop=526, step=1)

In [174]:
dataset = df5.join(onehot, how='outer')

In [175]:
dataset.sample(5)

Unnamed: 0,Tent ID,# of Tent Residents,# of Years Living in Village,# of Years Living in Tent,Family Bacgkround,Problems in the Tent - Additional Comments,How often do you get sick?,Additional Comments - Health,"Do you have any other comments, questions or other information you’d like to add?","How often do you have friends, family or neighborhoods over to your tent?",Would living in a block home create any changes in your life?,HH1 Female,HH1 Male,HH1 Agriculture/Fish,HH1 Contracted Worker,HH1 Driver,HH1 Family Provides,HH1 Laundry / Servant,HH1 Laundry/Housekeeper,HH1 Lesiv/Servant,HH1 Nothing,HH1 Other,HH1 Paid Consistent Job,HH1 Small business outside or nearby the home,HH1 Student,HH1 Vendor,Sleep Length,Number of Residents with Income,Do you normally wake up at night?,Do you ever have trouble staying awake during the day?,Underage kids in the family %,Water Usage (gallon) per resident per day,Children Living Elsewhere No,Children Living Elsewhere Yes,Rent or Own Elsewhere No,Rent or Own Elsewhere Yes,Kids Education No,Kids Education No Response,Kids Education Some,Kids Education Yes,Ownership Other,Ownership Yes,Ownership Rent,Previous Ownership Live with Family,Previous Ownership No Response,Previous Ownership Other,Previous Ownership Own House,Previous Ownership Rent,Marital Status Common Law,Marital Status Married,Marital Status Other,Marital Status Single,Marital Status Widow,Suffer from cough or congestion? No,Suffer from cough or congestion? Yes,Suffer from bronchitis or pneumonia? No,Suffer from bronchitis or pneumonia? Yes,Suffer from diarrhea? No,Suffer from diarrhea? No Response,Suffer from diarrhea? Yes,Do you have access to a latrine? No,Do you have access to a latrine? Yes,Do you have electricity in your tent? No,Do you have electricity in your tent? Yes,Main source of drinking water:Buy,Main source of drinking water:No,Main source of drinking water:Pump or well,"Main source of drinking water:Pump, well, or buy",Do you ever drink water that isn't treated? No Response,Do you ever drink water that isn't treated? No,Do you ever drink water that isn't treated? Yes,Is there any risk that the tent will collapse? No,Is there any risk that the tent will collapse? Yes,In the past year did someone enter your house to steal something? No,In the past year did someone enter your house to steal something? Yes,Do you have space to lie down if tired? No,Do you have space to lie down if tired? Yes,Do people living in the tent have space to keep their personal belongings? No,Do people living in the tent have space to keep their personal belongings? Yes,"In this tent, if someone wakes up, do they wake up the other people? No","In this tent, if someone wakes up, do they wake up the other people? Yes",Do children have safe places to study? No,Do children have safe places to study? Yes,Does your household own any animals? No,Does your household own any animals? Yes,Does your household own a radio? No,Does your household own a radio? Yes,Do you feel safe in your home? No,Do you feel safe in your home? Yes,Do you feel safe leaving your children alone at home? No,Do you feel safe leaving your children alone at home? Yes,Do you feel safe walking in the community at night? No,Do you feel safe walking in the community at night? Yes,Do you own the land the tent is on? No,Do you own the land the tent is on? Yes,"If you were to receive a house, would you be willing to move to the area behind Healing Haiti? No","If you were to receive a house, would you be willing to move to the area behind Healing Haiti? Yes",Do you feel this person qualifies for a home? No,Do you feel this person qualifies for a home? Yes,"What are the dwellings floors made of? Ceramic, wood, concrete",What are the dwellings floors made of? Dirt or soil,What are the dwellings roof made of? Tarp,What are the dwellings roof made of? Tin
417,90,4,26,7.0,Audio,Audio,3,Tifoid ak tet femal ak fyÃ¨v,Gen pafwa,2,Audio,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,9.0,1.0,0,3,0.25,5.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
314,557,2,7,7.0,I came here to go to school in 2010 my daughter was 8 months old at the time. But it didn't work out. I am here now trying to make a living.,Water gets inside when it rains. And during hurricanes seasons it gets rough,3,My daughter gets sick sometimes. I don't have a lot t take care of here. She's not eating right and driving unclean water.,No more comments.,2,"Oh! That would please me. That would make me very happy. If God was to remove me from this situation…. Very soon rain is going to fall, and I have nowhere to shield myself from it. If God could build me a house I would be very happy.",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0,0,0,0.5,2.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
67,238,4,12,7.0,"My parents left where they were out in the country to come here so that the kids could go to school, because where they were wasn't a great place.\n","When it rains, I get wet. When it's windy, we can't stay inside because the tarp is beating ""bow bow"". Also, animals sometimes come inside because there's nothing on the ground to stop them.",3,Colds and the flu and headaches,That's what we live on because we don't have money to buy water.,2,"The advantage I'll get is that I won't get wet. I'll live comfortably. When I put my things there they won't get lost. When I go out, I won't be stressed out because where I'm living, people won't be able to come in easily.",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0,3,0.5,5.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
281,532,2,7,7.0,I moved here because of the earthquake. Life was difficult in town after the earthquake so we moved here.,Water gets in during the rain and it's hot. Not safe during hurricane season.,1,We are healthy thanks to God.,Nothing more.,2,"With my child we have no one else helping us. It would change our lives. And I'd be really thankful\nAudio: The problems I have along with my child… [inaudible, lots of wind]",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.0,0.0,0,0,0.5,2.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
274,511,4,17,7.0,Men tevin lekÃ²l Mwen tou rete Isi,Audio,3,FyÃ¨v ak grip,Selim bwÃ¨,2,Audio,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,10.0,0.0,0,3,0.75,6.25,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0


In [176]:
##Introducing pairwise interactive features below doesn't improved clustering performance
#dataset['Male HH Unemployed'] = dataset['HH1 Male']*dataset['HH1 Nothing']
#dataset['Male HH Steady Job'] = dataset['HH1 Paid Consistent Job']* dataset['HH1 Male']
#dataset['Chidern Living Rented or Owned House Elsewhere'] = dataset['Children Living Elsewhere Yes']*dataset['Rent or Own Elsewhere Yes']
#dataset['Untreated-Water-caused diarrhea'] = dataset['Suffer from diarrhea? Yes']* dataset['Do you ever drink water that isn\'t treated? Yes']

In [177]:
dataset.isnull().sum()

Tent ID                                                                                               0
# of Tent Residents                                                                                   0
# of Years Living in Village                                                                          0
# of Years Living in Tent                                                                             0
Family Bacgkround                                                                                     0
Problems in the Tent - Additional Comments                                                            0
How often do you get sick?                                                                            0
Additional Comments - Health                                                                          0
Do you have any other comments, questions or other information you’d like to add?                     0
How often do you have friends, family or neighborhoods over to y

In [178]:
dataset.shape

(526, 103)

In [179]:
dataset.to_csv('/Users/satokosuda/dataforcause/new_story_data/dataset.csv', index=False)

In [180]:
# 5 columns of text data, each of them describes different aspect of difficulty of living in tents. Cannot combine them. 
# Need to process each text column seperately.