# Activity 3.2 – Recoding and Aggregating a Health Care Survey

## Part 1 – Three ways to recode

Before we get the the main event, let’s practice recoding the survey
data in three ways. In all cases, our goal is to assign a score of `5` to
`"Strongly Agree"` down to a score of `1` to `"Strongly Disagree"`.

| Old Label                    | Regular Coded Value |
|------------------------------|---------------------|
| `"Strongly Disagree"`          | 1                   |
| `"Somewhat Disagree"`          | 2                   |
| `"Neither Agree nor Disagree"` | 3                   |
| `"Somewhat Agree"`             | 4                   |
| `"Strongly Agree"`             | 5                   |

**Preparation.** Recall that the `more_dfply.recode` function allows us to
recode a column using a `dict`.

1.  Load the health survey data found in the data folder.  Inspect the column names, then fix the issues with the `"."`s.  **Hint.** We can use `dfply.rename` with the `dict` unpacking trick.

In [1]:
# Your code here

2.  Use `unique` to verify the labels of various columns. Create a `dict` mapping each survey responses to the corresponding score.

In [7]:
# Your code here

3.  Test your dictionary with the `map` method on one of the columns.  The columns `F2` and `F1.1` are good test cases.

In [11]:
# Your code here

4. Explain why you should be using the `pandas` `Int64` data type here.  Details on the necessity and use of this data type can be found in [Lecture 2.4](https://github.com/wsu-stat489/module2_intro_to_pandas/blob/main/2_4_pandas_dtypes.ipynb)

<font color="blue"> *Your thoughts here* </font>

#### Method 1 – Brute Force. 

The naïve approach to applying our mapping
is the construct a `mutate`, writing one line per column. This will become
annoying.

1.  Create a pipe that uses `mutate` to transform at least 10 of the columns.

In [1]:
# Your code here

2.  Explain,

    1.  in vivid detail, exactly how annoying it would be to continue
        this process.

    

<font color="blue"> *Your thoughts here* </font>

    B.  how this might be prone to buggy code.

<font color="blue"> *Your thoughts here* </font>

#### Method 2 – dictionary unpacking. 

Recall that we cleaned up repeated,
similar transformations in [Lecture 3.5](https://github.com/wsu-stat489/module3_more_about_mutate/blob/main/3_5_DRY_and_many_transformations.ipynb). Let’s apply that approach here.

1.  Get a `list` of all the questions columns, e.g., using `dfply.columns_from`.

In [9]:
# Your code here

2.  Pick one of the columns and write the expression to transform that
    column. Be sure to use the `df["col_string"]` method of referencing
    the column.

In [1]:
# Your code here

3.  Create a variable to hold this column string. Replace the hard coded
    column name with a variable. Rerun to test.  

In [1]:
# Your code here

4.  Convert the single expression to a `dict` using a comprehension
    that iterates over all column names. We want the keys to the be the
    column names and values the resulting recoded columns.  Rerun to
    test. Clean up the code by packaging the complexity in `lambda` functions.

In [1]:
# Your code here

5.  Use our dictionary in a mutate using `**` unpacking.  

In [1]:
# Your code here

#### Method 3 – Stack Transform Unstack. 

Another method for applying the
same transformation to multiple columns is to (A) stack all the columns
that need transformation, (B) applying the transformation to the stacked
values column, and (C) unstacking the data back into the original shape.
This is the approach you will apply in **Part 2**.

## Part 2 – Performing data preparation

Dr. Bergen, Director of the Statistical Consulting Center at WSU, needs
you to prepare the attached data for analysis. The file
**health_survey.csv** contains the responses to a series of
health-related questions and we need to recode the responses as 1-5
using the definition below. It is important to note that the clients
consider “Strongly Agree” the best answer for most of the questions
coded and want it mapped to 5, but there are a handful of questions for
which they want the mapping reversed as “Strongly Disagree” the
preferred answer. The two types of codings are illustrated in the table
shown below and the list of questions that should receive the reverse
coding are available in the file **ReverseCodingItems.csv**.

| Old Label                    | Regular Coded Value | Reverse Coding |
|------------------------------|---------------------|----------------|
| “Strongly Disagree”          | 1                   | 5              |
| “Somewhat Disagree”          | 2                   | 4              |
| “Neither Agree nor Disagree” | 3                   | 3              |
| “Somewhat Agree”             | 4                   | 2              |
| “Strongly Agree”             | 5                   | 1              |

**Note.** I have prototyped this process in JMP and have provided screenshots of the resulting tables as a guide.

1.  Read in the `ReverseCodingItems.csv` file.  Note that the names in the `"Column Name"` column contain `"."`s.  Fix this, then look at the questions that need reverse coding and explain why it makes sense to reverse the coding on these items.

<font color="blue"> *Your thoughts here* </font>

2. Next, you will perform the data preparation by completing each of
    the tasks listed below.  First, *Stack* the response columns.

<img src="img/media/image1.png" style="width:2.91924in;height:1.85212in" alt="../../../../Desktop/Screen%20Shot%202018-03-22%20at%201.30.46%20" />

Work in a pipe and add a temporary `head` at the end.

In [6]:
# Your code here

3.  Make a new column called *Needs Reverse* by joining on
    the `"Yes"` or `"No"` values from the *Needs Reverse Coding?* from **ReverseCodingItems.csv.**

> <img src="img/media/image2.png"
> style="width:3.78315in;"
> alt="../../../../Desktop/Screen%20Shot%202018-03-22%20at%201.35.26%20" />

Start by copying and adding to the pipe from the previous cell. In practice, you would continue to work in the same cell, but we are illustrating the standard best practice in working with data, by continually cycling through the following steps

1. Add the next step
2. Rerun your code to test.

I want to verify that you are following this practice, which is why we will be copying our previous code to the next step

In [1]:
# Copy your last pipe and edit the code here

4.   We already created a `dict` for the regular coding.  Make a `dict` for the reverse coding using a `dict` comprehension.  Remember to use the `items` method and two names to iterate through the original `dict`.  Use subtraction.

In [1]:
# Copy your last pipe and edit the code here

5.   Next we need to make a column with the question type.  For example, all questions that start with `F1`, like `F1` and `F1_1`, need to be coded as `F1`.  Add a `mutate` to create this column using one of the string transformations (extract, split, etc.) from this module.

> <img src="img/media/image6.png"
> style="width:2.97769in;"
> alt="../../../../Desktop/Screen%20Shot%202018-03-22%20at%201.45.03%20" />

In [1]:
# Copy your last pipe and edit the code here

6.  Make a new column by *Recoding* the Question Types to *F1, F2, …,
    F6.* based on the `Question Type`. **Hint:** You might want to use the `dfply.ifelse`.

> <img src="img/media/image6.png"
> style="width:2.97769in;"
> alt="../../../../Desktop/Screen%20Shot%202018-03-22%20at%201.45.03%20" />

In [1]:
# Copy your last pipe and edit the code here

7.  *Aggregate* and *unstack.*

> <img src="img/media/image7.png"
> style="width:2.46321in;"
> alt="../../../../Desktop/Screen%20Shot%202018-03-22%20at%201.46.11%20" />
> <img src="img/media/image8.png"
> style="width:2.96893in;"
> alt="../../../../Desktop/Screen%20Shot%202018-03-22%20at%201.46.34%20" />

In [1]:
# Copy your last pipe and edit the code here

8.  Save the table to a csv file in the `data` folder.

In [1]:
# Your code here

**Deliverables.** Submit this document with your answer to question 1,
the JMP file containing the results of Part 1, and a csv file with your
final table.