# Analyzing Survey Data

There are a lot of different tools and methods to analyze survey data.  Most survey tools have built in analysis capabilities that work to a varying degree.  Qualtrics has spent a lot of efort in this area because it is a major pain point.  

Regradless of this effort, many people still have a manual process that involves a variety of different tools.  The net result for most analysis is a PowerPoint deck with the summary of the results an commenary and recommendations.

To start closing the loop and since we don't have the data in any specific system, we will be using Python to analyze the survey data.  This will give us one additional chance to explore the way that Python can be used in data science.

`Note: It is possible to analyze survey data in Excel and many people are sucessful at doing that.  The way it works makes it difficult however.  As a general purpose graphical tool it makes something easier, but a lot of things much harder.`

## Analysis Steps

The basic steps that you will want to go through when analyzing data are:

1. Download the data and convert to a format appropriate for the software you will be using - We want a .csv file for Python
2. Read in the data file and manually check to make sure that the process is working as expected
3. Recode the data - This is especially important for missing and N/A data
4. Clean the data - Generally removing entire records when the responses of the respondents are questionable
5. (Optional) Weight Data - This step greatly complicates the further analysis so it is often skipped in cases where the data is approximately representative
6. Run Cross Tabulations (crosstabs) to summarize the data and compare and contrast questions - many people have a script run all possible crosstabs
7. Prepare visualizations to explore data and start to plan for the visualization necessary in the final report
8. Perform any necessary statistical testing (This step is often not preformed.  If you are stat testing crosstabs the appropriate test is a chi-squared test)
9. Prepare presentation - This might be in collaboration with the internal or external client, but the analyst will almost always be involved in the process

## Download data

This step will depend heavily on the software system you are using.  If you decided to go with a third part survey programming and hosting service the third part will handle this for you.  This is what happened for our survey.

When you get the survey data from a third party make sure you get a copy of the survey as it was programmed.  It is common that there are last minute changes to the questions and wording.  Unless you are militant about maintaining the survey document it will not match the programmed survey and you will not know what you are analyzing.

Many survey programs will provide a print out of the survey designed for the analysis stage.  This can be a helpful backup if you were not faithful about maintaining the original survey document.

# Read in data file

This should be straight forward as it is like using any other data file


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


In [8]:
# Read in data

## Code Data

The purpose of coding the data is putting it into a format that is easy for you as the analysis to understand.

The first step is to make sure that the missing and N/A values are coded correctly.  In .csv files the standard it to just leave the value blank.  This signals to the software that this is a missing value.  Other pacakages however will use a `.`, `-`, or a `NA`.  You want to make sure that this is in a format that you can actually use.

The second step is to translate categorical values stored as numbers to labels that are meaningful for analysis.  It gets really annoying to constantly be looking up, "What does 3 mean for this question?"  You will want to supply the substitutions up front.

In Python you will want to use the Pandas categorical data type for this. (https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

In [None]:
# Code Data

## Clean the data

Cleaning the data usually means ensuring that all the respondents included in your survey were attentive and provided answers that were reasinable.  This is a hard step because it requires making judgements about who to remove and who not to remove.  There is a the potential for introducing bias into your data if you are too agressive about cleaning respondent out.  Almost always if ther eis a problem with a resppondent you will remove their entire record.

__Always do this on a copy of your data file not on the original!__ You want to be able to 1) go back to the original file and 2) document the steps that you took in case you need to justify or revisit the cleanign procedure.  I like to do the cleaning in code for this reason.  Code always is complete and tells the truth.

When cleaning data there are three major categories of bad responses that you will want to look at:
1. Check for missing data
2. Check for straightliners/christmas tree responses
3. Check for speeeders

In [9]:
# Clean Data

## Weight the data

This topic is too complicated to cover in this class.  It is best to avoid the issue by ensuring the sample is properly balanced using quotas and careful sample selection.  If you are in a situation where you need to weight your data it is best to have professional help from somebody that has experience in this area.  

When you weight the data you will be influencing all the subsequent analysis and will need to account for that weighting especially if you do any statistical testing.

## Crosstabs

Crosstabs is a a contraction of crosstabulations and is used to refer to counting up the responses in a set of categories.  Strictly it refers to comparing two separate questions or variables, but in practice it refers to any counting summary. You will often want to prepare crosstabs for all your questions and a broad selection of question pairs.  This is the primary output that you will analyze to compare and contrast the results.

If you have properly coded your data this step is very easy to handle as there are builtin tools to create the crosstabs.

In [10]:
# Compute Cross tabs

## Exploratory Visualizations

When you create visualizations at this step you are primarily using them to explore the data rather than preparing presentation quality reports.

The primary visualizations you will use are:
- Bar Charts
- Histograms
- Occasionally scatter plots

You generally will not be creating line charts because you don't have data over time in a survey.

Pay attention to the difference between proportions and raw counts.  Each one has it's uses, but you need to be aware and intentional about which one you use.


In [11]:
# Create visualizations

## Statistical Testing 

How much testing you do is going to be dependent on your organization, audience, and data.  This is something that we haven't covered heavily in this class, but it is worth investigating.  Most people that I work with outsource statistical testing to an expert because there are some subtle nuances that are difficut to keep track of if you don't do it regularly and don't have extensive training.

## Prepare final report

This is something that you are all probably better at than I am since you get extensive practice doing this in the MBA program.