Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Connection reset by peer @ io_fillbuf - fd:158 #56

Open
aimdevelopment opened this issue Sep 22, 2022 · 0 comments
Open

Error Connection reset by peer @ io_fillbuf - fd:158 #56

aimdevelopment opened this issue Sep 22, 2022 · 0 comments

Comments

@aimdevelopment
Copy link

Hi, I wonder if you could please help as I'm encountering issues with this gem in a live project and I'm at a bit of a loss as to what is happening so if anyone can provide any help it would be greately appreciated!

Overview

The FTP integration is used by a number of old weather stations that connect via FTP, over a mobile network, and write 1 line of text every minute. I'm simply using an override for the write_file() function that looks at the filename and then passed the line to be written on to be written to the database. This was working just fine for months until recently when I started to see 'Error Connection reset by peer @ io_fillbuf - fd:{number}' errors. Once these start to appear in the logs, slowly all of the connections start to generate this error until the FTP server refuses to accept connections and the only way that seems to fix it so far is to restart the application, where it will run for a apparently random amount of time from 7 hours to over a day before this starts to happen again, requiring another restart.

System

  • Debian docker instance Debian GNU/Linux 9.13 (stretch)
  • ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
  • ftpd-2.1.0

What's unusual and did anything change?

  1. More weather stations were added but the total is only 28.
  2. These stations are using mobile network and are kind of slow. Mostly the session is < 1 second from connect to write and close but this can be up to 9 seconds from what I've seen.
  3. Apart from the weather stations connecting in, there's one connection from someone who is connecting in every 5 minutes to read the files for their own purposes, which is some legacy task from a 3rd part system.
  4. There is a miss-configured weather station that tries to connect every minute with the wrong credentials.
  5. There does not appear to be a consistent time frame between restart and the first instance of the error in the logs. Complete lockout (all connections refused) is almost daily though.

What are the settings?

It may be worth mentioning that this is running in a docker environment with mapped ports

The file system uses the default.

Initializers

require 'driver'

# Need to overwrite the write function so we can grab the data
Ftpd::DiskFileSystem.class_eval do
  
  def write_file(ftp_path, stream, mode)
    begin

      File.open(expand_ftp_path(ftp_path), mode) do |file|

        # Get the details from the path
        arr_acc_dh = file.path.gsub(FTP_ROOT, "").split("/")
        account_id = arr_acc_dh[1]
        datahog_id = arr_acc_dh[2]

        dh_importer = DataImporter.new()
        while line = stream.read

          # Ignore anything other than day data
          if ftp_path.include? "daydata.txt"
            dh_importer.import_data(line, account_id, datahog_id)
          elsif ftp_path.include? "rssidata.txt"
            dh_importer.import_ssri_data(line, account_id, datahog_id)
          end  
          file.write line
        end
      end
    
    rescue Exception => e

      # Something bad happened
      puts "ERROR : #{DateTime.now} : #{Thread.current.object_id} : write_file:  #{e.full_message} \n"      
    end
  end

end


# Override to set host and port 
Ftpd::Session.class_eval do

  def initialize_session
    @logged_in = false
    @data_type = 'A'
    @mode = 'S'
    @structure = 'F'
    @data_channel_protection_level = :clear
    @name_prefix = '/'
    # @data_hostname = nil
    # @data_port = nil
    @data_hostname = '{hostname}'
    @data_port = 2157 
    @protection_buffer_size_set = false
    @epsv_all = false
    close_data_server_socket
    reset_failed_auths
  end

end 

# Strange error on stat request on live server. Debug attempt
Ftpd::CmdStat.class_eval do

  def cmd_stat(argument)
    reply "220 sd/ftpd 2.1.0"
  end

end

Main execution block

# Set the log output and override to include thread info
skye_logger = Logger.new($stdout) #Logger.new('/var/log/ftpd.log')
skye_logger.formatter = proc do |severity, datetime, progname, msg|
   "#{severity} : #{datetime} : #{Thread.current.object_id} : #{msg}\n"
end

driver = Driver.new("/var/www/shared/ftp_storage" , skye_logger) # Pass logger to driver 
server = Ftpd::FtpServer.new(driver)
server.log = skye_logger
server.port = 2157
server.interface = '0.0.0.0'
server.nat_ip = '{External Server IP}'
server.passive_ports = Range.new(4021, 4060)
server.allow_low_data_ports = false
server.session_timeout = 45
server.failed_login_delay = 120
server.max_connections = 100
server.max_connections_per_ip = 1
server.tls = :off

server.start
puts "FTP Server listening on port #{server.bound_port}"
puts "FTP Server interface #{server.interface}"
begin
  gets
rescue Interrupt
  puts "Interrupt"
end`

Driver Code

require 'ftpd'
require "tmpdir"

class Driver

  def initialize(ftp_root, logger)
    @logger = logger
    @ftp_root = ftp_root
    @full_datahog_data_path = ''
    @logger.debug "passed ftp_root #{ftp_root}"
  end

  def authenticate(user, password)
    begin
      
      # Expect the username to be consisting of 2 numbers account_id and hog_id. 
      account_id = ''
      datahog_id = ''
      if user.split("_").count != 2
        raise "invalid login. Must consist of account_id and datahog_id e.g. '123_456'"
      end  
      
      account_id = user.split("_")[0].to_i
      datahog_id = user.split("_")[1].to_i

      datahog = Datahog.includes(:account).references(:account).where(id: datahog_id).where({account: { id: account_id}}).where({datahogs: { ftp_password: password}}).where(account: {active: true})
    
      if datahog.blank?
        @logger.info "Invalid FTP credentials or inactive account. No matching datahog found for user: #{user} and password: #{password}"
        false
      else
      
        # Create the account directory if it doesn't exist already
        ftp_root_with_account = "#{@ftp_root}/#{account_id}"      

        unless File.exist?(ftp_root_with_account)
          FileUtils.mkdir_p(ftp_root_with_account)
        end

        # Now the datahog directory
        self.instance_variable_set(:@full_datahog_path, "#{ftp_root_with_account}/#{datahog_id}")

        unless File.exist?(self.instance_variable_get(:@full_datahog_path))
          FileUtils.mkdir_p(self.instance_variable_get(:@full_datahog_path))
        end 

        self.instance_variable_set(:@full_datahog_data_path, "#{self.instance_variable_get(:@full_datahog_path)}/data")

        unless File.exist?(self.instance_variable_get(:@full_datahog_data_path))
         FileUtils.mkdir_p(self.instance_variable_get(:@full_datahog_data_path))
        end

        @logger.debug "User '#{user}'. instance var full_datahog_data_path : '#{self.instance_variable_get(:@full_datahog_data_path)}'"

        true  
      end 
      
    rescue Exception => ex
      @logger.error "FTP driver - Authentication failure #{ex.full_message}"
      false
    end
  end

  def file_system(user)
    Ftpd::DiskFileSystem.new(self.instance_variable_get(:@full_datahog_path))
  end

end

Modification to get full all thread stack trace

CommandLoop modified with this function

def read_and_execute_commands
      catch :done do
        begin
          reply "220 #{server_name_and_version}"
          loop do
            begin
              s = get_command
              s = process_telnet_sequences(s)
              syntax_error unless s =~ /^(\w+)(?: (.*))?$/
              command, argument = $1.downcase, $2
              unless valid_command?(command)
                error "Syntax error, command unrecognized: #{s.chomp}", 500
              end
              command_sequence_checker.check command
              execute_command command, argument
            rescue FtpServerError => e
              reply e.message_with_code
            rescue => e

              # Get the full back trace
              puts "Error thread #{Thread.current.object_id}. Local error in processing. Error #{e.message}"
              thread_count = 0
              err_msg = ""
              Thread.list.each do |t|
                thread_count += 1
                err_msg += "--- thread #{thread_count} of total #{Thread.list.size}. Thread.id #{t.object_id} backtrace begin \n"
                # Lets see if we are able to pin down the culprit
                # by collecting backtrace for all existing threads:
                err_msg += t.backtrace.join("\n")
                err_msg += "\n---thread #{thread_count} of total #{Thread.list.size}. Thread.id #{t.object_id} backtrace end \n"
              end
              puts "Full thread & object backtrace : - \n #{err_msg}"

              reply "451 Requested action aborted. Local error in processing. Error #{e.message}"
              config.exception_handler.call(e) unless config.exception_handler.nil?

            end
          end
        rescue Errno::ECONNRESET, Errno::EPIPE
        end
      end
    end

Working example weather station transfer

DEBUG : 2022-09-22 14:24:45 +0100 : 429400 : 220 SmartData_Skye 2.1.0
DEBUG : 2022-09-22 14:24:46 +0100 : 429400 : USER 19_51
DEBUG : 2022-09-22 14:24:46 +0100 : 429400 : 331 Password required
DEBUG : 2022-09-22 14:24:46 +0100 : 429400 : PASS **FILTERED**
DEBUG : 2022-09-22 14:24:46 +0100 : 429400 : User '19_51'. instance var full_datahog_data_path : '/var/www/shared/ftp_storage/19/51/data'
DEBUG : 2022-09-22 14:24:46 +0100 : 429400 : 230 Logged in
DEBUG : 2022-09-22 14:24:46 +0100 : 429400 : PASV
DEBUG : 2022-09-22 14:24:46 +0100 : 429400 : 227 Entering passive mode (212,110,172,113,15,234)
DEBUG : 2022-09-22 14:24:47 +0100 : 429400 : TYPE I
DEBUG : 2022-09-22 14:24:47 +0100 : 429400 : 200 Type set to I
DEBUG : 2022-09-22 14:24:47 +0100 : 429400 : CWD data
DEBUG : 2022-09-22 14:24:47 +0100 : 429400 : 250 "/data" is current directory
DEBUG : 2022-09-22 14:24:48 +0100 : 429400 : APPE 220922daydata.txt
DEBUG : 2022-09-22 14:24:48 +0100 : 429400 : 150 Opening BINARY mode data connection
DEBUG : 2022-09-22 14:24:51 +0100 : 429400 : import_data.account_id : 19 import_data.datahog_id : 51
DEBUG : 2022-09-22 14:24:51 +0100 : 429400 : 13:17:00  22.09.22   00  0425.28  01  018.69   02  004.462  03  00210    04  0000.00  05  01.4573  06  019.52   07  055.105
DEBUG : 2022-09-22 14:24:51 +0100 : 429400 : Date time found: '22.09.22 13:17:00'. Format to use '%d.%m.%y %H:%M:%S'
DEBUG : 2022-09-22 14:24:51 +0100 : 429400 : Received 126 bytes
DEBUG : 2022-09-22 14:24:51 +0100 : 429400 : 226 Transfer complete

Note: Above is the most common scenario in that I don't always see the extra quit line...

DEBUG : 2022-09-22 14:19:11 +0100 : 419900 : QUIT
DEBUG : 2022-09-22 14:19:11 +0100 : 419900 : 221 Byebye

...This is why I set the session timeout low in case connections aren't being closed.

Example of the odd one out
This connection is not made by the weather stations but by a 3rd party FTP request for their own purposes

DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 220 SmartData_Skye 2.1.0
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : USER 17_49
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 331 Password required
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : PASS **FILTERED**
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : User '17_49'. instance var full_datahog_data_path : '/var/www/shared/ftp_storage/17/49/data'
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 230 Logged in
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : FEAT
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 211-Extensions supported:
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 :  EPRT
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 :  EPSV
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 :  MDTM
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 :  SIZE
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 211 END
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : TYPE I
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 200 Type set to I
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : CWD data
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 250 "/data" is current directory
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : EPSV
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 229 Entering extended passive mode (|||4054|)
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : RETR 220922daydata.txt
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 150 Opening BINARY mode data connection
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : Sent 12789 bytes
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 226 Transfer complete
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : MODE S
DEBUG : 2022-09-22 14:36:07 +0100 : 446940 : 200 Mode set to Strea

Full stack trace on io_fillbuf error including all threads

Included log from the start of thread id 2753380 : -

20220922_full_stack_trace.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant