# Fragile Family Challenge

## Introduction

Verbatim from Fragile Families Challenge website:

- The Fragile Families & Child Wellbeing Study is following a cohort of nearly 5,000 children born in large U.S. cities between 1998 and 2000, roughly three-quarters of whom were born to unmarried parents. We refer to unmarried parents and their children as “fragile families” to underscore that they are families and that they are at greater risk of breaking up and living in poverty than more traditional families.

- The core Study was originally designed to primarily address four questions of great interest to researchers and policy makers: (1) What are the conditions and capabilities of unmarried parents, especially fathers?; (2) What is the nature of the relationshipsbetween unmarried parents?; (3) How do children born into these families fare?; and (4) How do policies and environmental conditions affect families and children?  

- The core Study consists of interviews with both mothers and fathers at birth and again when children are ages one, three, five, and nine. The parent interviews collect information on attitudes, relationships, parenting behavior, demographic characteristics, health (mental and physical), economic and employment status, neighborhood characteristics, and program participation. Additionally, in-home assessments of children and their home environments were conducted at ages three, five, and nine. The in-home interview collects information on children’s cognitive and emotional development, health, and home environment. Several collaborative studies provide additional information on parents’ medical, employment and incarceration histories, religion, child care and early childhood education. A fifteen-year follow-up wave includes a collection of in-home and telephone survey data from caregivers and teens.

## Problem Description
You are provided with all the data set from the years 1 through 9. There are some training data from year 15, the goal of this project is to predict six key outcomes in the year 15. The six key outcomes are - grit, GPA, material hardness, eviction, job loss, and job training. These include 3 binary and three continuous outcomes. GPA, Grit and material hardship are continuous outcomes. House eviction, layoffs of caregiver, job training of a caregiver are binary outcomes.

## Data Description

The `input_files` folder contains three comma-separated values files, a text codebook file, a Stata .dta file that contains the data with variable and value labels, and a text file identifying features that are constant.

---
- background.csv contains 4,242 rows (one per family) and 13,027 columns
- background.dta contains the same information, plus variable and value labels, in a Stata data file.

These files contain:

challengeID: A unique numeric identifier for each child.

13,026 background variables asked from birth to age 9, which you may use in building your model.

---
- train.csv contains 2,121 rows (one per child in the training set) and 7 columns.

These are the outcome variables measured at approximately child age 15, which you can use to train your models.

The file contains:

challengeID: A unique numeric identifier for each child.

Six outcome variables. Blog posts about the outcomes are available at http://www.fragilefamilieschallenge.org/blog-posts/

Continuous variables: grit, gpa, materialHardship

Binary variables: eviction, layoff, jobTraining

---
- codebook_FFChallenge.txt is a text file that contains the codebook for all variables in the Challenge data file. This combines several codebooks from the main Fragile Families and Child Wellbeing Study documentation.

---
- constantVariables.txt gives the column names of variables that are constant in the data. Some of these variables are constant because they have been redacted out of concern for the privacy of respondents. 

## Solution Approach
As a first step we remove the constant columns to ease the computation and remove non-contributing features


In [1]:
import numpy as np
import pandas as pd

In [3]:
# Read the background information into pandas
BACKGROUND_FILE = './input_files/background.csv'
background_data = pd.read_csv(BACKGROUND_FILE, low_memory=False) #Certain columns have mixed data type
background_data.head()

Unnamed: 0,challengeID,cf1intmon,cf1intyr,cf1lenhr,cf1lenmin,cf1twoc,cf1fint,cf1natsm,f1natwt,cf1natsmx,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
0,1,-3,-3,-9,-9,-3,0,-3,-3,-3,...,6.269946,5.180325,2.511131,1.718804,6.473537,16.369411,4.476881,9.628369,15.981275,24.038266
1,2,-3,-3,0,40,-3,1,-3,-3,-3,...,6.269946,27.680196,2.511131,1.718804,6.473537,16.369411,26.671897,9.628369,15.981275,3.667679
2,3,-3,-3,0,45,-3,1,-3,-3,-3,...,6.269946,5.180325,20.867881,24.115867,6.473537,16.369411,4.476881,9.628369,15.981275,24.038266
3,4,-3,-3,0,45,-3,1,-3,-3,-3,...,6.269946,5.180325,22.018875,22.932641,6.473537,-5.169243,4.476881,9.628369,-6.303171,4.140511
4,5,-3,-3,-6,50,-3,1,-3,-3,-3,...,6.269946,5.180325,22.916602,22.988036,6.473537,-6.03466,4.476881,9.628369,-6.211828,3.668879


In [6]:
# Read the constant variable file to get the column headers for constant variables. Remove the constant columns
CONSTANT_VARIABLES_FILE = './input_files/ConstantVariables.txt'
constant_vars = np.loadtxt(CONSTANT_VARIABLES_FILE, dtype='str').flatten()
background_data.drop(constant_vars, axis=1, inplace=True)
background_data.head()

Unnamed: 0,challengeID,cf1lenhr,cf1lenmin,cf1fint,cf1citsm,f1citywt,f1a2,f1a3,f1a4,f1a4a,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
0,1,-9,-9,0,-9,-3.0,-9,-9,-9,-9,...,6.269946,5.180325,2.511131,1.718804,6.473537,16.369411,4.476881,9.628369,15.981275,24.038266
1,2,0,40,1,1,68.455658,2,1,1,-6,...,6.269946,27.680196,2.511131,1.718804,6.473537,16.369411,26.671897,9.628369,15.981275,3.667679
2,3,0,45,1,1,42.319057,1,1,1,-6,...,6.269946,5.180325,20.867881,24.115867,6.473537,16.369411,4.476881,9.628369,15.981275,24.038266
3,4,0,45,1,1,25.62883,1,1,1,-6,...,6.269946,5.180325,22.018875,22.932641,6.473537,-5.169243,4.476881,9.628369,-6.303171,4.140511
4,5,-6,50,1,1,41.954487,2,1,1,-6,...,6.269946,5.180325,22.916602,22.988036,6.473537,-6.03466,4.476881,9.628369,-6.211828,3.668879
