# Analyzing Survey Data: Understanding the Growth of Finnish Companies with SQL & Python

## Survey Data on Finnish Companies' Growth Perception

This dataset presents insights gathered from a survey focusing on the growth trajectory of Finnish companies. Top managers provided their perceptions regarding growth, innovativeness, and the capacity for renewal within their respective organizations.

Source: [Suominen & Pihlajamaa, 2022](https://www.sciencedirect.com/science/article/pii/S2352340922005261)
Dataset: Access the dataset [here](https://zenodo.org/records/5820394#.Y5OKl-zMK3I)
The survey data is stored in a CSV file named "survey_data.csv".

Data Dictionary
The dataset comprises the following key columns:

- Growth_Firm: Indicates whether the company (firm) is currently classified as a growth company according to OECD definitions.
- question_2_row_1_transformed: Responses to question 2, part 1, with certain transformations applied beforehand.
- question_2_row_2_transformed: Responses to question 2, part 2, with certain transformations applied beforehand.
- question_3_row_1: Responses to question 3, part 1.
  ...
- question_7_row_1: Responses to question 7, part 1.


It's important to note that the dataset doesn't include the actual questions posed during the survey. However, a comprehensive description of each question is available in survey_questions.csv. We'll delve into the specifics of these questions as we address them throughout our analysis.

In [47]:
# Import plotly.express using the alias px
import plotly.express as px

# From scipy.stats import the mannwhitneyu function
from scipy.stats import mannwhitneyu

## Disclosure:
Due to the European-style CSV settings of our file, the default CSV reading settings cannot be applied. However, our workspace offers a solution: you can import data directly from a CSV file into a SQL query using DuckDB's read_csv_auto() function within the FROM clause.

In [48]:
-- Select everything from survey_data.csv
SELECT *
FROM 'survey_data.csv'

Unnamed: 0,Growth_Firm,question_2_row_1_transformed,question_2_row_2_transformed,question_3_row_1,question_3_row_2,question_3_row_3,question_3_row_4,question_3_row_5,question_3_row_6,question_3_row_7,question_3_row_8,question_3_row_9,question_3_row_10,question_3_row_11,question_3_row_12,question_3_row_13,question_3_row_14,question_3_row_15,question_3_row_16,question_4_row_1,question_4_row_2,question_4_row_3,question_4_row_4,question_5_row_1,question_5_row_2,question_5_row_3,question_5_row_4,question_5_row_5,question_5_row_6,question_5_row_7,question_5_row_8,question_5_row_9,question_5_row_10,question_6_row_1,question_6_row_2,question_7_row_1
0,0,351351351351351,507509391319659,4,5,5,4,3,3,4,4,4,2,2,2,2,4,4,3,4,4,4,4,1,1,2,4,2,4,2,3,2,5,4,5,1
1,0,230180426462548,51182200341316,5,4,4,4,4,4,4,5,5,4,2,4,2,4,4,3,4,3,3,4,4,4,2,3,4,3,3,3,4,3,5,4,1
2,0,866404715127701,629326385264931,3,4,4,4,4,3,4,5,3,3,3,5,3,4,4,4,4,4,4,4,4,4,4,5,4,4,4,4,,,5,3,1
3,0,176470588235294,391304347826087,3,4,5,4,4,4,5,5,3,3,4,5,4,4,5,3,4,3,3,3,3,2,3,3,3,4,4,4,3,3,3,3,1
4,0,60,328021248339973,4,4,4,4,3,4,4,4,5,5,2,3,1,2,4,2,4,2,2,2,2,2,2,4,2,4,2,3,3,4,5,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,1,227868852459016,141745068285281,3,4,4,3,2,4,3,3,3,3,4,4,4,3,3,3,3,2,2,2,2,2,2,2,3,3,3,4,3,3,3,4,2
116,1,316666666666667,446149645002731,5,5,5,4,4,5,5,4,5,5,5,4,3,4,1,4,3,2,2,1,2,3,2,4,4,2,2,2,3,3,2,2,2
117,1,566666666666667,499683995922528,4,5,5,4,4,4,5,5,3,3,5,4,5,5,5,4,4,5,4,5,5,5,5,5,4,4,4,5,4,5,4,4,2
118,1,471428571428571,465770862800566,4,5,4,5,5,4,4,4,4,4,5,5,2,5,5,5,2,2,2,2,5,5,5,5,5,4,2,5,5,5,1,1,2


In questions 2.1 and 2.2, we encounter numeric data. An error is evident as these columns utilize a comma as a decimal separator instead of a point. Consequently, I will address this issue in my code by implementing necessary cleaning measures.

Additionally, given the presence of blank cells within the data, I intend to employ 'nullstr' to effectively account for these null values.

In [49]:
-- Select everything from survey_data.csv
SELECT *
FROM read_csv_auto('survey_data.csv', delim=';', decimal_separator=',', nullstr=' ')

Unnamed: 0,Growth_Firm,question_2_row_1_transformed,question_2_row_2_transformed,question_3_row_1,question_3_row_2,question_3_row_3,question_3_row_4,question_3_row_5,question_3_row_6,question_3_row_7,question_3_row_8,question_3_row_9,question_3_row_10,question_3_row_11,question_3_row_12,question_3_row_13,question_3_row_14,question_3_row_15,question_3_row_16,question_4_row_1,question_4_row_2,question_4_row_3,question_4_row_4,question_5_row_1,question_5_row_2,question_5_row_3,question_5_row_4,question_5_row_5,question_5_row_6,question_5_row_7,question_5_row_8,question_5_row_9,question_5_row_10,question_6_row_1,question_6_row_2,question_7_row_1
0,0,35.135135,50.750939,4,5,5,4,3,3,4,4,4,2,2,2,2,4,4,3,4,4,4,4,1,1,2,4,2,4,2,3,2.0,5.0,4,5,1
1,0,23.018043,51.182200,5,4,4,4,4,4,4,5,5,4,2,4,2,4,4,3,4,3,3,4,4,4,2,3,4,3,3,3,4.0,3.0,5,4,1
2,0,86.640472,62.932639,3,4,4,4,4,3,4,5,3,3,3,5,3,4,4,4,4,4,4,4,4,4,4,5,4,4,4,4,,,5,3,1
3,0,17.647059,39.130435,3,4,5,4,4,4,5,5,3,3,4,5,4,4,5,3,4,3,3,3,3,2,3,3,3,4,4,4,3.0,3.0,3,3,1
4,0,60.000000,32.802125,4,4,4,4,3,4,4,4,5,5,2,3,1,2,4,2,4,2,2,2,2,2,2,4,2,4,2,3,3.0,4.0,5,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,1,227.868852,1417.450683,3,4,4,3,2,4,3,3,3,3,4,4,4,3,3,3,3,2,2,2,2,2,2,2,3,3,3,4,3.0,3.0,3,4,2
116,1,316.666667,446.149645,5,5,5,4,4,5,5,4,5,5,5,4,3,4,1,4,3,2,2,1,2,3,2,4,4,2,2,2,3.0,3.0,2,2,2
117,1,566.666667,4996.839959,4,5,5,4,4,4,5,5,3,3,5,4,5,5,5,4,4,5,4,5,5,5,5,5,4,4,4,5,4.0,5.0,4,4,2
118,1,471.428571,465.770863,4,5,4,5,5,4,4,4,4,4,5,5,2,5,5,5,2,2,2,2,5,5,5,5,5,4,2,5,5.0,5.0,1,1,2


Next, I aim to gain insight into the survey questions. To achieve this, I will import the data dictionary containing details about the survey questions.

In [1]:
SELECT *
FROM 'survey_questions.csv'

Unnamed: 0,column,question,row,section,title,response_type
0,question_2_row_1_transformed,2,1,estimated growth,Expected employee count in five years (as a pe...,numeric
1,question_2_row_2_transformed,2,2,estimated growth,Expected revenue in five years (as a percent f...,numeric
2,question_3_row_1,3,1,company culture,Employees are encouraged to be creative,agree_disagree
3,question_3_row_2,3,2,company culture,Managers are expected to be creative problem s...,agree_disagree
4,question_3_row_3,3,3,company culture,Employees' ability to function creatively is r...,agree_disagree
5,question_3_row_4,3,4,company culture,We are constantly looking for ways to develop ...,agree_disagree
6,question_3_row_5,3,5,company culture,Assistance in developing new ideas is readily ...,agree_disagree
7,question_3_row_6,3,6,company culture,Our organization is open and responsive to cha...,agree_disagree
8,question_3_row_7,3,7,company culture,"Managers here are always searching for fresh, ...",agree_disagree
9,question_3_row_8,3,8,company culture,Our organization has a clear and inspiring set...,agree_disagree


## Analyzing Question 2.1
Question 2 asks: > If the firm develops the way you would like it to, how much revenue would the firm receive, and how many employees would it have five years ahead? Disregard possible inflation.

To initiate our analysis, I will visualize the first aspect of the question, pertaining to employee count. Given that the response type for this question is numeric, a histogram provides a suitable visualization method.

In [51]:
px.histogram(
    survey, 
    x="question_2_row_1_transformed",
    labels={
        "question_2_row_1_transformed": "Expected employee count in five years (as a percent from last available year)"
    }
)


An intriguing question arises: Do companies currently classified as "growth" have differing expectations regarding the number of additional employees they plan to hire over the next five years compared to "non-growth" companies? To explore this, I will create histograms for each category, facetting our initial plot accordingly.

It's worth noting that in our dataset, Growth_Firm=0 represents "non-growing companies," while Growth_Firm=1 signifies "growing companies."

In [52]:
px.histogram(
    survey, 
    x="question_2_row_1_transformed",
    labels={
        "question_2_row_1_transformed": "Expected employee count in five years (as a percent from last available year)"},
    facet_row= "Growth_Firm"
)


## Visualizing Question 2.2

In [53]:
import plotly.express as px

px.histogram(
    survey, 
    x="question_2_row_2_transformed",
    labels={
        "question_2_row_2_transformed": "Expected revenue in five years (as a percent from last available year)"},
    facet_row="Growth_Firm",
    range_x=[0, 6000]
)

## Calculating Statistical Significance

The histograms derived from question 2.1 exhibit strikingly similar distributions. Nevertheless, there may exist a statistically significant distinction between the two groups.

Given the right-skewed nature of our data, employing a T-test isn't suitable. Therefore, I intend to utilize a Mann-Whitney U test, also known as a Wilcoxon Rank Sum test, to perform a comparative analysis between them.

In [54]:
SELECT question_2_row_1_transformed
FROM read_csv_auto('survey_data.csv', delim=';', decimal_separator=',', nullstr=' ')
WHERE Growth_Firm = 0

Unnamed: 0,question_2_row_1_transformed
0,35.135135
1,23.018043
2,86.640472
3,17.647059
4,60.0
5,-1.295497
6,12.275449
7,66.666667
8,9.375
9,506.060606


In [55]:
SELECT question_2_row_1_transformed
FROM read_csv_auto('survey_data.csv', delim=';', decimal_separator=',', nullstr=' ')
WHERE Growth_Firm = 1

Unnamed: 0,question_2_row_1_transformed
0,580.272109
1,166.666667
2,400.000000
3,7.296137
4,25.000000
...,...
57,227.868852
58,316.666667
59,566.666667
60,471.428571


In [56]:
# First I have saved the previous queries as dataframes on datacamp
# Perform Mann-Whitney U test on q2_1_non_growth and q2_1_growth
mannwhitneyu(q2_1_non_growth, q2_1_growth)

MannwhitneyuResult(statistic=array([1299.]), pvalue=array([0.00884359]))

The Mann-Whitney U test resulted in a statistic of 1299 and a p-value of 0.00884359. These findings imply a statistically significant difference between the two compared groups. With a p-value below the conventional significance threshold of 0.05, we reject the null hypothesis, supporting the alternative hypothesis. This suggests that the distributions of the two groups are indeed significantly different.

## Visualizing Likert scales (rating scales)
Numerous questions in the survey dataset feature categorical responses with five options ranging from "Strongly Disagree" to "Strongly Agree."

The values are represented by integers, with 1 indicating "Strongly Disagree" and 5 representing "Strongly Agree." To enhance the visualization of responses, it is preferable to utilize explicit labels rather than numerical values.

In [57]:
-- Import everything from agree_disagree.csv as lookup
SELECT *
FROM 'agree_disagree.csv'

Unnamed: 0,code,response
0,1,Strongly disagree
1,2,Disagree
2,3,Neither agree or disagree
3,4,Agree
4,5,Strongly agree


We aim to obtain counts for each of the five responses, even if some are absent in the dataset. This necessitates allowing zero counts. To accomplish this, I will implement a left join.

In [58]:
SELECT 
      lookup.response, 
	  COUNT(survey.question_3_row_1) AS n,
	  lookup.code - 3 AS agreement
FROM 'agree_disagree.csv' AS lookup
LEFT JOIN read_csv_auto('survey_data.csv', delim=';', decimal_separator=',', nullstr=' ') AS survey
ON lookup.code=survey.question_3_row_1
GROUP BY lookup.response, lookup.code
ORDER BY lookup.code;

Unnamed: 0,response,n,agreement
0,Strongly disagree,0,-2
1,Disagree,6,-1
2,Neither agree or disagree,18,0
3,Agree,67,1
4,Strongly agree,29,2


Categorical variables of this nature, featuring a neutral response and two opposing sets of responses (agreeing and disagreeing), are ideally visualized using a diverging color scale.

In [59]:
# Notice that I have saved on datacamp our previous query as a dataframe named q3_1_counts 

px.bar(
    q3_1_counts,
    x="response",
    y="n",
    color="agreement",
    color_continuous_scale= px.colors.diverging.Portland_r
)

## How do the pandemic impacted this companies? 
Question number 7 inquires: 
> Has the COVID-19 pandemic has a sig- nificant impact on your firm's actions re- lated to the topics mentioned above during the previous year

In [60]:
SELECT 
      lookup.response, 
	  COUNT(survey.question_7_row_1) AS n,
	  lookup.code - 3 AS agreement
FROM 'agree_disagree.csv' AS lookup
LEFT JOIN read_csv_auto('survey_data.csv', delim=';', decimal_separator=',', nullstr=' ') AS survey
ON lookup.code=survey.question_7_row_1
GROUP BY lookup.response, lookup.code
ORDER BY lookup.code;

Unnamed: 0,response,n,agreement
0,Strongly disagree,54,-2
1,Disagree,66,-1
2,Neither agree or disagree,0,0
3,Agree,0,1
4,Strongly agree,0,2


In [61]:
px.bar(
    q7_1_counts,
    x="response",
    y="n",
    color="agreement",
    color_continuous_scale= px.colors.diverging.Portland_r
)